漢字字碼與資料庫國際研討會,京都•東京 1996年10月4日 從缺字問題,談漢字交換碼的重新設計──第二部分 A Descriptive Method for Re-engineering Hanzi |
Page 6 of 9 |
3. Design Issues
(1)Three Segmented Code Word In order to provide a mapping among a character, its associate variants and possible selection of font types, a three segmented code word for Hanzi is designed as follows: < a code word> ::= < Character code> (. Glyph extension) (- Font id.) where parenthesis indicate optional terms. For example, a code word of a Hanzi can be either one shown as follows, a character code of a Hanzi is
simply : < a character code> The data type of <glyph extension> is simply a positive integer N which indicates the Nth variant of a character. The <font id.> is an identifier of a certain type of font. And, the < a character code > can be any existing Hanzi Code or a new code as we proposed later. The three segmented coding scheme of Hanzi can be used in conjunction with the glyph expression described earlier. In this case the character code can be replaced by glyph expression or component sequence of a glyph. Besides, the glyph extension can also be coded to identify various character sets, such as assigning glyph extension equals to 2 for simplified characters. Thus we can code more knowledge about characters into the system. (2)Fidelity Levels Another advantage of the three segmented coding scheme is its flexibility to provide different levels of fidelity for various application requirements. The fidelity levels are:
(3)Glyph Structure as Code Word In previous sections, it is shown that each glyph has an unique glyph structure expression and hence it can be used as the identifier of the glyph. In this section, a new coding scheme for Hanzi interchange based upon glyph structure will be presented. There are several advantages to use glyph structure as a foundation of designing Hanzi Interchange Code. Firstly, the component set as well as the root set of Hanzi is a closed system. It will not expand indefinitely as the character set does. Although there are rare chances that they might expand, they are far more manageable then that of character set. For reference, consider the root set developed at Chiao-Tung University in 1972, the 496 roots derived from a set of 9132 glyphs actually can produce 48713 glyphs in a bigger collection without introducing any new root. The second advantage is that the glyph structure model is a productive system, in mathematical terms. This means it has the expandability and flexibility of not changing the existing system over newly created characters. Thirdly, glyph structure is a kink of knowledge representation. It no only facilities character coding and human reading, but also provide more knowledge for further information processing applications. (4) The Glyph Bata Base A glyph data base is implemented in IBM compatible PC with Window 3.1 operating system and with Chinese Extension ( Taiwan Version ). Programs are developed by Visual Basic language and databases by ProFox DBMS. So far there three character sets already built in our system. ( 1) A set of fundamental Characters for Computer use《中文電腦基本用字》. This set collected 8529 characters with 593 variants. In its glyph model, there are 629 components, 457 roots。A pictorial illustration of this set is shown in〔Figure 3〕. (2). An extension of the formal set by including simplified characters used in PRC. The character count does not increase , but the number of variant increased to 2284 and the component increased to 664, the number of roots to 492. (3).The third set is a supplement character set for Buddhist Text《電子佛典補充字集》. Now , more than 2000 missing characters are collected in this set. The member is glowing day by day. |
Page 6 of 9 |