文獻處理實驗室 論文目錄

漢字字碼與資料庫國際研討會,京都•東京 1996年10月4日
從缺字問題,談漢字交換碼的重新設計──第二部分

A Descriptive Method for Re-engineering Hanzi
Information Interchange Codes

Page 4 of  9

2. The Representation of Hanzi Knowledge

(3)A Glyph Model

Our glyph model is shown in Figure 4. In this model, the part connected by line 12345is the formal system shown in Table 4〕。Lines 6 to 7 show the composition of strokes to produce root. The decomposition of glyph into components can be expressed as in the following two equations.

Figure 4

Figure 4

Let G be glyph, R be root, K be component, and T be basic stroke. Let p and s represent position and size , respectively, then

G=ΣK p, s.............1

R=ΣT p, s.............2

Equation (1) is iterative and should be governed by the rules in Table 4〕。

(a)The Representation of Glyph Structure

Let us explain how the structure of a glyph is represented in computer by examples. In the tree shown below, the glyphs in right most branch can be expressed as

Representation of Glyph Structure

These equations are generally referenced as glyph structure expressions, or glyph expressionsin short, in which r, s, andtrepresent the operations of vertical composition, horizontal composition, and contain composition, respectively. Although there is only one composition operator in each of the above equations, computer can iterate these expressions to obtain an expression solely with roots by eliminate components successively like follows.

灕=礡]离簬A)
 =礡](^鮸芋^礡]e))
 =礡]((B韙縑^藙芋^礡]礡]B禰D)))
 =礡]((B鞢]鯠@))藙芋^礡]礡]B禰D)))………
..(3)

In Equation 3, the glyph “灕” is composted up by 8 roots. The glyph expression , such as Equation 3, solely by roots is called a root expression. Similarly, a glyph expression by components such as equation 1 to 11 described previously may be called a component expression. When all the operators are eliminated, Equation (3) becomes.

灕=B乂紛B主…………………………………………………….(4)

Equation (4) is called the root sequenceof . Following the same naming thought, equations 1 to 11 after eliminating operators may be called component sequences.

In general, any representation called expressionrefers to a complete glyph structure information, while sequencedoes not. In other words, each glyph has an unique glyph expression and hence this expression can be served as the identifier of that glyph. Although sequencedoes not have complete structure information of glyph, it still has very high discriminating abilities among glyphs. For instance, in 林樹字集of 9129 glyphs, there are only 8 pairs of glyphs have exactly the same component sequence, such as (唄、員). All others can be uniquely identified by their component sequences.

In practice, all of the glyph expressions of a character set can be stored in computer. By doing this, all the knowledge of glyph structure information of a character set has been formally represented and stored for further use. For《中文電腦基本用字》, there are 9756 glyph expressions in which 8529 for authority character, 593 for variants, and 629 for components. While simplified characters are included, then, 2284 expressions are added for simplified variants and 35 are added for simplified new components. Thus, the number of component increased to 644, and the total number of glyph expressions increased to 11477. The 8529 expressions for authority characters remain unchanged.


Page 4 of  9
上一頁

論文目錄

下一頁

文獻處理實驗室