¤åÄm³B²z¹êÅç«Ç ½×¤å¥Ø¿ý

º~¦r¦r½X»P¸ê®Æ®w°ê»Ú¬ã°Q·|¡A¨Ê³£¡EªF¨Ê 1996¦~10¤ë4¤é
±q¯Ê¦r°ÝÃD¡A½Íº~¦r¥æ´«½Xªº­«·s³]­p¢w¢w²Ä¤G³¡¤À

A Descriptive Method for Re-engineering Hanzi
Information Interchange Codes

Page 2 of  9

2. The Representation of Hanzi Knowledge

(1)Previous Studies

(a) A Fundamental Character Set for Computer Use

Around the year of 1970, the study of using computer to process Hanzi information have been launched in Taiwan. During these days, the statistical knowledge of characters and words is not enough to support research need. Therefore, Mr. Su Lin of the Computer and Control Engineering Department of National Chiao-Tung University conducted a research on finding ¡§ A Fundamental Character Set for Computer Use¡¨ with the support of the Wang Laboratories. The research started in October of 1971, spent more than 2000 man-days, and a draft report was published in March, 1972. Some highlights of this research are listed as follows.

  1. A thorough survey of all the statistical studies of Hanzi from 1856 to 1971had been carried out. And a survey of character sets used by popular dictionaries and press media during that period had also been done.

  2. 11 character sets surveyed in item 1 were selected and weighted. Their union set was named as the Fundamental Character Set for Computer Use (¤¤¤å¹q¸£°ò¥»¥Î¦r). A list of these 11 sets are shown in ¡eTable 1¡f. These 11 sets actually covers more than 30 previous statistical works on character set..

  3. Variants are also collected. References about variants including ªL»y°ó¡q¾ã²zº~¦r¯ó®×¡rand °ê¥ß½sĶÀ]¡m±`¥Î¦r²Î¤@¦r§Î¼È¥Îªí¡n¡]1968¡^. Some guidelines for selecting authority glyph are listed in ¡eTable 2¡f. Other glyphs, or variants, are also collected as ¡§°Ñ·Ó¦r§Î¡¨. So, there are simplified characters in the collected character set. In general, this research does not care whether the glyphs are right or wrong, complicated or simplified, normal or popular, ancient or up to date. The collection is merely aiming at all possible glyphs that computer might possibly confront. This altitude is not the same as that of all previous work.
  1. This research collected 8532 characters, and with additional 597variants. The total frequency count of usage is 2,022,604, and their distribution are listed as follows¡G
category Number of characters percentage of use

most frequently used

1857¦r,

97.34%

frequently used

2068¦r,

2.27%

occasional used

2182¦r,

0.27%

rare used

2425¦r,

0.12%

The entropy of the system is 9.60. The accumulated frequency of the most frequently used 500 characters are listed in ¡eTable 3¡f. Up to now, this survey is still the most comprehensive one and we use it as a foundation of our study.

Table 1 : The 11 character sets collected in¡m¤¤¤å¹q¸£°ò¥»¥Î¦r¶°¡n

  1. ²ø¿A«Å,¡m°ò¥»¦r·J¡n,¼s¦{¤¤¤s¤j¾Ç±Ð¨|¾Ç¬ã¨s©Ò,1930
  2. ­JÃC¥ß,¡m¤p¾Çªì¯Å¤À¯Å¼È¥Î¦r·J¡n±Ð¨|³¡,1935
  3. ±Ð¨|³¡,¡mª`­µº~¦r¡n°Ó°È¦L®ÑÀ],1935ªìª©,1961¥x¤@ª©
  4. ½²¼Ö¥Í,¡m±`¥Î¦r¿ï¡n­^¤å¤¤°ê¶l³øªÀ,1946
  5. ¥xÆW¬Ù°ê»y±À¦æ©e­û·|,¡m°ê­µ¼Ð·J½s¡n,¶}©ú®Ñ§½,1947ªìª©,1971¥x¤Gª©
  6. ¤ý²Mªi,¡m°ê¥Á¤p¾Ç²{¦æ°ê»y½Ò¥»°ê¦rªì²{½Ò¦¸¡N­«²{¦¸¼Æ¤§¤ÀªR¬ã¨s¡n, °ª¶¯¥«¬F©²,1963
  7. °ê¥ß½sĶÀ],¡m°ê¥Á¤p¾Ç±`¥Î¦r·J¬ã¨s¡n¤¤µØ®Ñ§½,1967
  8. ¥xÆW¹q«H§½,¡m¹q½X·s½s¡n,1967¼W­qª©
  9. ¬PµØ¥´¦r»ö¾¹¦æ,¡m¤¤¤å¥´¦r¾÷·sª©¤å¦r±Æ¦Cªí¡n,¥x¥_,1969
  10. ¥@¬É¤¤¤å³ø·~¨ó·|,¡m·s»D±`¥Î¦r·J¡n,1970
  11. ¤¤«nű¦r¼t,¡m±`¥Î¦rªí¡n,¥x¥_,1971

Table 2: ¡m¤¤¤å¹q¸£°ò¥»¥Î¦r¶°¡n²§Åé¦rªº¾ã²z­ì«h

  1. ´N¤w¦³¦r·J¿ï¨ú, ¤£¥t³Ð·s¦r¡C
  2. ¤@¦r¼Æ§Î,¨ú¨ä²«KªÌ¡C¦Ó¤£­p¨ä¥»Åé§í«UÅé,¥j¦r§í¤µ¦r¡C¥j¦r²«KªÌ±q¥j, ¦p¨ú¡u ¡v¤£¨ú¡u§¡v,
    ¤µ¤§Â²«KªÌ±q¤µ, ¦p¨ú¡u‚ï¡v¤£¨ú¡u¸¡v¡C
  3. ¤@¦r¼Æ§Î, ¨ú¨äµ²ºc¾A¦X¹q¸£³]­pªÌ¡C¦p¨ú¡u²¤¡v¤£¨ú¡u ¡v, ¨ú¡u¸Ì¡v¤£¨ú¡uùØ¡v¡C
  4. ¤@¦r¼Æ§Î, ¨ú¨ä³q¥ÎªÌ, ¦p¨ú¡u®³¡v¤£¨ú¡uÎÔ¡v¡C
  5. ¦b¥@«U¤W¤w³q¦æ¤@Åé, ¦Ó­ì¦rÁÙ¦³¨ä¥L·N¸qªº, «h¨âªÌ¨Ã¦s¡C¦p¡u§¿¡v¡B¡u·Ä¡v¡C

Table 3 : Accumulated frequency of¡m¤¤¤å¹q¸£°ò¥»¥Î¦r¶°¡n

number of characters

accumulated frequency

number of characters

accumulated frequency

number of characters

accumulated frequency

5

9.24%

50

32.39%

372

+70%

10

14.20%

60

35.22%

472

+75%

15

17.79%

80

39.92%

500

76.27%

20

20.54%

100

43.74%

1000

89.38%

30

25.13%

141

+50%

40

29.02%

232

+60%

Note ¡GThe sign¡§¡Ï¡¨indicates¡ujust over¡v. For example ¡Ï50% means the accumulated frequency of the first 141characters is just over 50%.

(b)The Chiao-Tung Root System

The earliest study of glyph structure in Taiwan was carried out in National Chiao-Tung University. Thus, the system developed was named as the Chiao-Tung Root System. In 1972, a master dissertation of Mr. ­Ù¯Õ,¡m¤¤°ê¤å¦r¤§µ²ºc¼Ò¦¡¤Î¨ä¤ÀªR¡n analyzed 16 different compositional operators of glyph and found that only three of them, namely horizontal composition, vertical composition and contain composition, are frequently used and can be assembled as an effective system for representing the structure of glyphs. A formal representation of the system in Bakcus Normal Form is shown in ¡eTable 4¡f.

Table 4 : A formal representation of Hanzi Glyph in Bakcus Normal Form

¡qcharacter set¡r

¡G¡G¡×

¡qglyph¡r¡þ¡qsymbols¡r

¡qsymbols¡r

¡G¡G¡×

including punctuation symbols, pronunciation symbols, and others

¡qglyph¡r

¡G¡G¡×

¡qroot¡r¡þ¡qcomponent¡r¡þ¡qglyph¡r¡qoperator¡r¡qglyph¡r

¡qcomponent¡r

¡G¡G¡×

¡qroot¡r¡þ¡qcomponent¡r¡qoperator¡r¡qcomponent¡r

¡qoperator¡r

¡G¡G¡×

horizontal¡Nvertical¡Nand contain

¡qroot¡r

¡G¡G¡×

there are 496 roots as shown in ¡eTable 5¡f

Mathematically speaking, the system in ¡eTable 4¡f is a production system. That means the possible produced outcomes, such as character, glyph and component usually are far more than we may accepted. Whether the outcome is a legal item or not is up to our choice. Thus, this system has the property of expandability that just fit our need.

In this productive system, the structure of a glyph is expressed as an expression of roots. Root is a basic component which will not be decomposed further. An example of glyph structure is shown in ¡eFigure 1¡f. The term component is usually used to reference the intermediate parts between a glyph and its roots. In ¡eFigure 1¡f, ÆWandÅsare glyphs,‰µis a component, ¤}¡N¨¥¡N¨tare glyphs and are roots also, „Eis a root¡C

The work of Mr.­Ù¯Õwere done parallel with the work of¡m¤¤¤å¹q¸£°ò¥»¥Î¦r¶°¡nby Mr. Su Lin, and the set of Chiao-Tung Roots was found by them. This root set has a unique figure that it is obtained by three times iteration of an optimization

Table 5 ¤¤¤å¦r®Úªí¡]¨ÌÀW²v°ª§C±Æ¦C¡^

procedure. The optimization procedure is derived from a mathematical calculation of optimizing a polynomial expression of the total number of the roots and the averaged number of roots per glyph. In general, the less the number of roots, the longer the decomposition of glyph. The optimization produce an criteria as follows : a glyph should not be decomposed if its frequency of usage is over 0.3758%, should be decomposed into no more than 2 roots if its frequency is from 0.1879% to 0.3758%, no more than three roots from 0.1236% to 0.1879%, and no more than 4 roots from 0.0939% to 0.1236%¡i1¡j. This criteria determined the bottom line of decomposition.

According to the 9132 glyphs in¡m¤¤¤å¹q¸£°ò¥»¥Î¦rªí¡n, 496 roots are obtained as listed in ¡eTable 5¡f. In this root set, 305 roots are characters and the frequency of usage of them exceed 50% of the total usage. The accumulated frequency of the most frequently used 25 roots is 30%, for 50 roots increased to 49%, 100 roots to 66,7%, 200 roots to 84.9%, and 300 roots up to 95%. ¡i2¡j

As a result of optimization, the weighted average of number of roots per glyph is only 1.9¡CAnd the power of this system can be illustrated by showing that while checking against the 49905 characters of¡m¤¤¤å¤jÃã¨å¡n, 48713 characters can be expressed by the Chiao-Tung Root system. The remaining 1129 characters areó¤å, ½f¦r, or some ancient ¥j¤å¡N¤Ï¤å¡N¹ÏÄËetc. If needed, these 1129 glyphs can be included into the Chiao-Tung Root System at any time without difficulty. A pictorial illustration of this result is shown in ¡eFigure 2¡f.

The Chiao-Tung Root System is developed according to the ·¢®Ñfont. Therefore, it does not include the font of½f¡NÁõ¡N¦æ¡N¯ó, etc. It does not taking account of calligraphic variants , print fonts and the modern artistic fonts, neither. All it can provide is the common structure of glyph which is the basis of every font design.

Figure2. 496 roots obtained from 9129 glyphs can produce 48713 glyphs and more.(A study of 1972 by¡e1¡f)

Figure2. 496 roots obtained from 9129 glyphs can produce 48713 glyphs and more.

¡i1¡j¦¹Ãä»Ú®Ä¥Î¤§­pºâ¡A½Ð°Ñ¦Ò¡GÁ²M«T¡N¶À¥Ã¤å¡NªL¾ð¡A¡m¤¤¤å¦r®Ú¤§¤ÀªR¡n¥æ¤j¾Ç¥Z¡A²Ä¤»¨÷¡D²Ä¤@´Á¡A1973¦~2¤ë

¡i2¡jÃö©ó9129­Ó¦r§Î¤§¤À¸Ñ¤Î¦r®Ú¤§¨Ï¥ÎÀW«×¤§¸ê®Æ¡A½Ð°Ñ¾\¼B¹F¤H¡N§ù±Ó¤å¡NÁ²M«T¡N±i¥ò³³¡N½²¤¤¤t¡NªL¾ð¡mº~¦rºî¦X¯Á¤Þ¦r¨å¡nAsian Associates, Bedford, New York 1979


Page 2 of  9
¤W¤@­¶

½×¤å¥Ø¿ý

¤U¤@­¶

¤åÄm³B²z¹êÅç«Ç