文獻處理實驗室 論文目錄

漢字字碼與資料庫國際研討會,京都•東京 1996年10月4日
從缺字問題,談漢字交換碼的重新設計──第二部分

A Descriptive Method for Re-engineering Hanzi
Information Interchange Codes

Page 5 of 9

2. The Representation of Hanzi Knowledge

(4)Markup Tags for Missing Character and Variant

(a) Kanji Placeholder

Applying SGML tag to markup missing character was first used by Wittern and App.3The technique they developed is called Kanji Placeholder (漢字位標, 簡稱位標)。Kanji Placeholder starts with ”&” and closes with “;” , and there are two fields in-between. The first field is an code identifier and the second field is a code word at which the missing glyph was found. Kanji Placeholder provides a linkage crossing Interchange Codes to share glyphs.

As an example,&U4AB5;represent a missing glyph found at location 4AB5 of Unicode which is identified by the field U. Kanji Placeholder is helpful to share glyphs collected by various Interchange Codes provided that there are a collection of Interchange Codes accessible be user. Wittern and App do have a Kanji Base on Internet which collects many Interchange Codes, such as CNS of more than 45 thousands glyphs and Unicode of approximately 22 thousands glyphs, etc. Besides, they also build a bank for missing characters which can not be found in any of the Interchange Codes they collected.

【3】 Christian Wittern and Urs App.〈IRIZ Kanji Base : A New Strategy for Dealing with Missing Chinese Characters 〉
世界電子佛典會議(EBTI)台北, 1996年4月

(b)Hanzi Glyphholder

Kanji Placeholder assigns SGML tag as the carrier of a pointer indicating the location of missing glyph. We extended their idea by applying SGML tag to show the structure of the missing glyph and use it as the identifier of the missing glyph also. The technique is called Kanji Glyphholder.

In order to avoid using any graphic symbol of code word as control symbol, special symbols and are created to represent open delimiter and close delimiter of the glyphholder tag, respectively. In the carrier formed by glyph holder tags is the glyph expression of missing glyph. For the convenience of use, component sequence is allowed to replace glyph expression as long as there is no ambiguous happened. For example, in Buddhist Canon 阿門佛can be expressed as阿門人人人佛, or阿門 佛.

The glyph holder can also be used to represent a variant by applying the glyph code of the three segment coding schemeof a character described in the later section. For example, “芍藥•3represent “芍葯”, ifis the third variant of.

Glyphholder and placeholder are compatible with each other, because they share the same tagging structure. The two fields of the placeholder can also be accepted in glyphholder if a parser is designed to recognized them. The glyphholder has some interesting properties that the placeholder can not provide. For instance, the glyphholder is more readable than placeholder, the user is not required to looking for the missing glyph elsewhere in order to obtain an identifier or a code word for the missing character, the users front end is not required to equipped with a data base of collecting various glyphs and variants, and finally, the glyphholder can express the mapping relation between character and glyph.

(5)Attributes

The attributes we collected so far for a character are listed in the following table.

表六 文字屬性欄位表  (註:打”*”者, 可以重複)

甲、缺字屬性表

1.缺字統一編號 * 5.筆劃數 * 9.注音
2.交換碼 6.首筆 *10.異體字交換碼
3.內碼(造字檔內) 7.次筆 *11.登錄日期及修改記錄
* 4.部首 8.未筆 *12.提供缺字之各單位欄位
 (含編號及內碼)

乙、字形結構屬性表

1.所屬字集編號 * 5.筆劃 9.部件二
2.交換碼 6.首筆 10.部件三
3.字形碼 7.分解方式 11.字頻次
* 4.部首 8.部件一 12.字根頻次(當用為字根時)
13.字根次(當用為字根時)

Page 5 of  9
上一頁

論文目錄

下一頁

文獻處理實驗室