文獻處理實驗室 論文目錄

漢字字碼與資料庫國際研討會,京都•東京 1996年10月4日
從缺字問題,談漢字交換碼的重新設計──第二部分

A Descriptive Method for Re-engineering Hanzi
Information Interchange Codes

Page 2 of  9

2. The Representation of Hanzi Knowledge

(1)Previous Studies

(a) A Fundamental Character Set for Computer Use

Around the year of 1970, the study of using computer to process Hanzi information have been launched in Taiwan. During these days, the statistical knowledge of characters and words is not enough to support research need. Therefore, Mr. Su Lin of the Computer and Control Engineering Department of National Chiao-Tung University conducted a research on finding A Fundamental Character Set for Computer Usewith the support of the Wang Laboratories. The research started in October of 1971, spent more than 2000 man-days, and a draft report was published in March, 1972. Some highlights of this research are listed as follows.

  1. A thorough survey of all the statistical studies of Hanzi from 1856 to 1971had been carried out. And a survey of character sets used by popular dictionaries and press media during that period had also been done.

  2. 11 character sets surveyed in item 1 were selected and weighted. Their union set was named as the Fundamental Character Set for Computer Use (中文電腦基本用字). A list of these 11 sets are shown in Table 1. These 11 sets actually covers more than 30 previous statistical works on character set..

  3. Variants are also collected. References about variants including 林語堂〈整理漢字草案〉and 國立編譯館《常用字統一字形暫用表》(1968. Some guidelines for selecting authority glyph are listed in Table 2. Other glyphs, or variants, are also collected as “參照字形”. So, there are simplified characters in the collected character set. In general, this research does not care whether the glyphs are right or wrong, complicated or simplified, normal or popular, ancient or up to date. The collection is merely aiming at all possible glyphs that computer might possibly confront. This altitude is not the same as that of all previous work.
  1. This research collected 8532 characters, and with additional 597variants. The total frequency count of usage is 2,022,604, and their distribution are listed as follows
category Number of characters percentage of use

most frequently used

1857,

97.34%

frequently used

2068,

2.27%

occasional used

2182,

0.27%

rare used

2425,

0.12%

The entropy of the system is 9.60. The accumulated frequency of the most frequently used 500 characters are listed in Table 3. Up to now, this survey is still the most comprehensive one and we use it as a foundation of our study.

Table 1 : The 11 character sets collected in《中文電腦基本用字集》

  1. 莊澤宣,《基本字彙》,廣州中山大學教育學研究所,1930
  2. 胡顏立,《小學初級分級暫用字彙》教育部,1935
  3. 教育部,《注音漢字》商務印書館,1935初版,1961台一版
  4. 蔡樂生,《常用字選》英文中國郵報社,1946
  5. 台灣省國語推行委員會,《國音標彙編》,開明書局,1947初版,1971台二版
  6. 王清波,《國民小學現行國語課本國字初現課次、重現次數之分析研究》, 高雄市政府,1963
  7. 國立編譯館,《國民小學常用字彙研究》中華書局,1967
  8. 台灣電信局,《電碼新編》,1967增訂版
  9. 星華打字儀器行,《中文打字機新版文字排列表》,台北,1969
  10. 世界中文報業協會,《新聞常用字彙》,1970
  11. 中南鑄字廠,《常用字表》,台北,1971

Table 2: 《中文電腦基本用字集》異體字的整理原則

  1. 就已有字彙選取, 不另創新字。
  2. 一字數形,取其簡便者。而不計其本體抑俗體,古字抑今字。古字簡便者從古, 如取「 」不取「禮」,
    今之簡便者從今, 如取「鵅v不取「繡」。
  3. 一字數形, 取其結構適合電腦設計者。如取「略」不取「 」, 取「裡」不取「堙v。
  4. 一字數形, 取其通用者, 如取「拿」不取「拏」。
  5. 在世俗上已通行一體, 而原字還有其他意義的, 則兩者並存。如「尿」、「溺」。

Table 3 : Accumulated frequency of《中文電腦基本用字集》

number of characters

accumulated frequency

number of characters

accumulated frequency

number of characters

accumulated frequency

5

9.24%

50

32.39%

372

+70%

10

14.20%

60

35.22%

472

+75%

15

17.79%

80

39.92%

500

76.27%

20

20.54%

100

43.74%

1000

89.38%

30

25.13%

141

+50%

40

29.02%

232

+60%

Note The sign“+”indicatesjust over. For example 50% means the accumulated frequency of the first 141characters is just over 50%.

(b)The Chiao-Tung Root System

The earliest study of glyph structure in Taiwan was carried out in National Chiao-Tung University. Thus, the system developed was named as the Chiao-Tung Root System. In 1972, a master dissertation of Mr. 倪耿,《中國文字之結構模式及其分析》 analyzed 16 different compositional operators of glyph and found that only three of them, namely horizontal composition, vertical composition and contain composition, are frequently used and can be assembled as an effective system for representing the structure of glyphs. A formal representation of the system in Bakcus Normal Form is shown in Table 4.

Table 4 : A formal representation of Hanzi Glyph in Bakcus Normal Form

character set

::=

glyph〉/〈symbols

symbols

::=

including punctuation symbols, pronunciation symbols, and others

glyph

::=

root〉/〈component〉/〈glyph〉〈operator〉〈glyph

component

::=

root〉/〈component〉〈operator〉〈component

operator

::=

horizontalverticaland contain

root

::=

there are 496 roots as shown in Table 5

Mathematically speaking, the system in Table 4is a production system. That means the possible produced outcomes, such as character, glyph and component usually are far more than we may accepted. Whether the outcome is a legal item or not is up to our choice. Thus, this system has the property of expandability that just fit our need.

In this productive system, the structure of a glyph is expressed as an expression of roots. Root is a basic component which will not be decomposed further. An example of glyph structure is shown in Figure 1. The term component is usually used to reference the intermediate parts between a glyph and its roots. In Figure 1, andare glyphs,is a component, 弓、言、系are glyphs and are roots also, Eis a root

The work of Mr.倪耿were done parallel with the work of《中文電腦基本用字集》by Mr. Su Lin, and the set of Chiao-Tung Roots was found by them. This root set has a unique figure that it is obtained by three times iteration of an optimization

Table 5 中文字根表(依頻率高低排列)

procedure. The optimization procedure is derived from a mathematical calculation of optimizing a polynomial expression of the total number of the roots and the averaged number of roots per glyph. In general, the less the number of roots, the longer the decomposition of glyph. The optimization produce an criteria as follows : a glyph should not be decomposed if its frequency of usage is over 0.3758%, should be decomposed into no more than 2 roots if its frequency is from 0.1879% to 0.3758%, no more than three roots from 0.1236% to 0.1879%, and no more than 4 roots from 0.0939% to 0.1236%1. This criteria determined the bottom line of decomposition.

According to the 9132 glyphs in《中文電腦基本用字表》, 496 roots are obtained as listed in Table 5. In this root set, 305 roots are characters and the frequency of usage of them exceed 50% of the total usage. The accumulated frequency of the most frequently used 25 roots is 30%, for 50 roots increased to 49%, 100 roots to 66,7%, 200 roots to 84.9%, and 300 roots up to 95%. 2

As a result of optimization, the weighted average of number of roots per glyph is only 1.9And the power of this system can be illustrated by showing that while checking against the 49905 characters of《中文大辭典》, 48713 characters can be expressed by the Chiao-Tung Root system. The remaining 1129 characters are籀文, 篆字, or some ancient 古文、反文、圖騰etc. If needed, these 1129 glyphs can be included into the Chiao-Tung Root System at any time without difficulty. A pictorial illustration of this result is shown in Figure 2.

The Chiao-Tung Root System is developed according to the 楷書font. Therefore, it does not include the font of篆、隸、行、草, etc. It does not taking account of calligraphic variants , print fonts and the modern artistic fonts, neither. All it can provide is the common structure of glyph which is the basis of every font design.

Figure2. 496 roots obtained from 9129 glyphs can produce 48713 glyphs and more.(A study of 1972 by1)

Figure2. 496 roots obtained from 9129 glyphs can produce 48713 glyphs and more.

【1】此邊際效用之計算,請參考:謝清俊黃永文林樹,《中文字根之分析》交大學刊,第六卷.第一期,1973年2月

【2】關於9129個字形之分解及字根之使用頻度之資料,請參閱劉達人杜敏文謝清俊張仲陶蔡中川林樹《漢字綜合索引字典》Asian Associates, Bedford, New York 1979


Page 2 of  9
上一頁

論文目錄

下一頁

文獻處理實驗室