A Descriptive Method for Re-engineering Hanzi Information Interchange Codes 9-6

3. Design Issues

(1)Three Segmented Code Word

In order to provide a mapping among a character, its associate variants and possible selection of font types, a three segmented code word for Hanzi is designed as follows:

< a code word> ::= < Character code> (. Glyph extension) (- Font id.)

where parenthesis indicate optional terms. For example, a code word of a Hanzi can be either one shown as follows,

a character code of a Hanzi is simply : < a character code>
a glyph code of a Hanzi is : < a character code>. < glyph extension>
a font code of a Hanzi is : < a character code>- < font id.>
a style code of a Hanzi is : <a character code>.<glyph extension>-<font id.>
or, < a character code>-<font id.>.<glyph extension>

The data type of <glyph extension> is simply a positive integer N which indicates the Nth variant of a character. The <font id.> is an identifier of a certain type of font. And, the < a character code > can be any existing Hanzi Code or a new code as we proposed later.

The three segmented coding scheme of Hanzi can be used in conjunction with the glyph expression described earlier. In this case the character code can be replaced by glyph expression or component sequence of a glyph. Besides, the glyph extension can also be coded to identify various character sets, such as assigning glyph extension equals to 2 for simplified characters. Thus we can code more knowledge about characters into the system.

(2)Fidelity Levels

Another advantage of the three segmented coding scheme is its flexibility to provide different levels of fidelity for various application requirements. The fidelity levels are:

The fidelity level is the lowest when only character code is used. In this case, the application does not care which glyph or which font is been used. For example, 台灣and臺灣are the same. The only matter it concerned is the meaning that the text carried. Many application can be satisfied at this level.
The next higher fidelity level is the situation where glyph code is been used. In this level, correct structure of glyphs is required, but not font.
While font code is been used, correct font is required, but not the glyph.
When style code is been used, correct glyph and correct font are required. The fidelity level is higher the those of item 2 and 3.
When some parameters of font have been specified as an extension of the font code, such as using font tags of HTML 3.2, Netscape 3.0, or Microsoft Explorer 3.0, then, the system provide the highest fidelity level.

(3)Glyph Structure as Code Word

In previous sections, it is shown that each glyph has an unique glyph structure expression and hence it can be used as the identifier of the glyph. In this section, a new coding scheme for Hanzi interchange based upon glyph structure will be presented.

There are several advantages to use glyph structure as a foundation of designing Hanzi Interchange Code. Firstly, the component set as well as the root set of Hanzi is a closed system. It will not expand indefinitely as the character set does. Although there are rare chances that they might expand, they are far more manageable then that of character set.

For reference, consider the root set developed at Chiao-Tung University in 1972, the 496 roots derived from a set of 9132 glyphs actually can produce 48713 glyphs in a bigger collection without introducing any new root. The second advantage is that the glyph structure model is a productive system, in mathematical terms. This means it has the expandability and flexibility of not changing the existing system over newly created characters. Thirdly, glyph structure is a kink of knowledge representation. It no only facilities character coding and human reading, but also provide more knowledge for further information processing applications.

(4) The Glyph Bata Base

A glyph data base is implemented in IBM compatible PC with Window 3.1 operating system and with Chinese Extension ( Taiwan Version ). Programs are developed by Visual Basic language and databases by ProFox DBMS.

So far there three character sets already built in our system. ( 1) A set of fundamental Characters for Computer use《中文電腦基本用字》. This set collected 8529 characters with 593 variants. In its glyph model, there are 629 components, 457 roots。A pictorial illustration of this set is shown in〔Figure 3〕. (2). An extension of the formal set by including simplified characters used in PRC. The character count does not increase , but the number of variant increased to 2284 and the component increased to 664, the number of roots to 492. (3).The third set is a supplement character set for Buddhist Text《電子佛典補充字集》. Now , more than 2000 missing characters are collected in this set. The member is glowing day by day.