¤åÄm³B²z¹êÅç«Ç ½×¤å¥Ø¿ý

º~¦r¦r½X»P¸ê®Æ®w°ê»Ú¬ã°Q·|¡A¨Ê³£¡EªF¨Ê 1996¦~10¤ë4¤é
±q¯Ê¦r°ÝÃD¡A½Íº~¦r¥æ´«½Xªº­«·s³]­p¢w¢w²Ä¤G³¡¤À

A Descriptive Method for Re-engineering Hanzi
Information Interchange Codes

Page 7 of  9

3. Design Issues

(5)The Kernel Set

As component expression is used as an identifier or a code word for a glyph, the average number of component per glyph has to be minimized. This can be done by limiting the operators in a component expression to be one. In this case the number of component per glyph usually is 2, and occasionally 3. Examples of one-operator expressions can be found in ¡eTable 7¡f. After all one-operator expressions of a character set have been collected, a kernel set which has all necessary elements for one-operator expression for any one character/glyph in the original character set can be found. A pictorial presentation of the kernel set of our glyph data base is shown in ¡eFigure 5¡f. This kernel set includes 457 roots,629components,and 1044 kernel glyphs. There are 2173 elements in total. Also, there are 1313 characters in it and the total frequency of usage is 62.60%.

Figure5¡GThe Development of Kernel Set (1995~1996)                      1996.08.23

Figure5¡GThe Development of Kernel Set (1995~1996)

(6)A Descriptive Method of Coding

Kernel set is a closed set. So, it can be coded traditionally by assign a numerical code word to each element. Let us refer this method as numerical coding. Since the total elements of kernel is relatively small, only 2173, it can be easily coded into a 2-byte code space. For the rest 7213 non-kernel characters/glyphs, glyph expressions can be used as their code words. This method is called descriptive coding. Descriptive code word does not require numerical coding space. Besides, descriptive coding is a production system which is capable of taking care additional characters/glyphs to the existing character set.

(7)Optimization

The kernel set in ¡eFigure 5¡f is not very efficient, because the total frequency of the kernel is only 62.60% as we mentioned earlier. This percentage can be optimized by including frequently used characters into the kernel set. A frequency distribution chart of our glyph data base is shown in ¡eFigure 6¡f. In ¡eFigure 6¡f, it is obvious that including the 1071 characters of the most frequently used category into the kernel will raise the total frequency of usage of the kernel to 97.31%. The price paid is the increase of the numerical coding space to 3243. Which is not bad and can be afforded by any 2-byte coding scheme.

Following the same thought, there are 5 levels of possible optimization of the kernel as listed in ¡eTable 6¡f. In ¡eTable 6¡f, the fourth choice is recommended which includes all glyphs of the most frequently used and the frequently used categories. It has 4988 elements and the averaged coding length is only 1.008 times of a 2-byte code.

Table 6 : ¡m¦r§Î¨t²Î¡n¦U¶¥¼h¤§¥­§¡¦r½Xªø«×ªí           85¦~8¤ë24¤é»sªí

¡m¦r§Î¨t²Î¡n ½X¦ì¼Æ ¥[Åv¥­¦r½Xªø«× Entropy
1.¦r®Ú¶° 457

1.9+1=2.9

7.3038
2.ºØ¦r¶° 2172 ¡@ ¡@
¡@ ¡@ (37.43x3+62.57)%=1.7486 8.6782
3.ºØ¦r¶°¥[±`¥Î¦r 3243 ¡@ ¡@
¡@ ¡@ (2.2476x3+97.7724)%=1.0452 ¡@
4.ºØ¦r¶°¥[¦¸±` 4988 ¡@ ¡@
¡@ ¡@ (0.3956x3+99.604)%=1.008- ¡@
5.¥þ¦r¶° 9346 1 9.1982

µù¡G­¼¥H3¬O¥Hºc¦r¦¡­pºâªø«×
¡@¡@­¼¥H2¬O¥H³¡¥ó§Ç­pºâªø«×
¡@¡@¦b¦r®Ú©Ê½è½T©w®É¡A¦p¥]²[®Ú¤ÎŠþ¡AŠöµ¥¡A«h¬Ù¥hÁp³s²Å¸¹¨Ã¤£¹üÅTªí¹F¤§©¾¹ê«×¡C

Figure 6¡GKernel Set and Frequency Distribution Chart

(8)Re-engineering of Existing Codes

Descriptive coding is compatible with existing codes. When 629 components and 188 non-character roots of the kernel we developed are included into an existing code, the code will have descriptive capability. It can be further optimized by excluding some rarely used characters to save coding space, if necessary. If a kernel set for each country can be derived, then , a CJK unified kernel can be formed , and hence, a CJK unified descriptive code can be constructed.


Page 7 of  9
¤W¤@­¶

½×¤å¥Ø¿ý

¤U¤@­¶

¤åÄm³B²z¹êÅç«Ç