A Descriptive Method for Re-engineering Hanzi Information Interchange Codes 9-4

2. The Representation of Hanzi Knowledge

(3)A Glyph Model

Our glyph model is shown in 〔Figure 4〕. In this model, the part connected by line 12345is the formal system shown in 〔Table 4〕。Lines 6 to 7 show the composition of strokes to produce root. The decomposition of glyph into components can be expressed as in the following two equations.

Figure 4

Let G be glyph, R be root, K be component, and T be basic stroke. Let p and s represent position and size , respectively, then

G＝ΣK （p, s）.............（1）

R＝ΣT （p, s）.............（2）

Equation (1) is iterative and should be governed by the rules in 〔Table 4〕。

(a)The Representation of Glyph Structure

Let us explain how the structure of a glyph is represented in computer by examples. In the tree shown below, the glyphs in right most branch can be expressed as

Representation of Glyph Structure

These equations are generally referenced as “glyph structure expressions”, or “glyph expressions” in short, in which �r, �s, and�trepresent the operations of “vertical composition”, “horizontal composition”, and “contain composition”, respectively. Although there is only one composition operator in each of the above equations, computer can iterate these expressions to obtain an expression solely with roots by eliminate components successively like follows.

灕＝𠺫岙（离岙隹）
　＝𠺫岙（（橣熷禸）岙（�岙庆））
　＝𠺫岙（（（𢯎熷凶）岙禸）岙（�岙（𢯎岙主）））
　＝𠺫岙（（（𢯎熷（塳㶊乂））岙禸）岙（�岙（𢯎岙主）））………..(3)

In Equation 3, the glyph “灕” is composted up by 8 roots. The glyph expression , such as Equation 3, solely by roots is called a “root expression”. Similarly, a glyph expression by components such as equation 1 to 11 described previously may be called a “component expression”. When all the operators are eliminated, Equation (3) becomes.

灕＝𠺫𢯎乂塳禸�𢯎主…………………………………………………….(4)

Equation (4) is called “the root sequence” of 灕. Following the same naming thought, equations 1 to 11 after eliminating operators may be called “component sequences”.

In general, any representation called “expression” refers to a complete glyph structure information, while “sequence” does not. In other words, each glyph has an unique glyph expression and hence this expression can be served as the identifier of that glyph. Although “sequence” does not have complete structure information of glyph, it still has very high discriminating abilities among glyphs. For instance, in 林樹字集of 9129 glyphs, there are only 8 pairs of glyphs have exactly the same component sequence, such as （唄﹑員）. All others can be uniquely identified by their component sequences.

In practice, all of the glyph expressions of a character set can be stored in computer. By doing this, all the knowledge of glyph structure information of a character set has been formally represented and stored for further use. For《中文電腦基本用字》, there are 9756 glyph expressions in which 8529 for authority character, 593 for variants, and 629 for components. While simplified characters are included, then, 2284 expressions are added for simplified variants and 35 are added for simplified new components. Thus, the number of component increased to 644, and the total number of glyph expressions increased to 11477. The 8529 expressions for authority characters remain unchanged.