A Descriptive Method for Re-engineering Hanzi Information Interchange Codes 9-5

2. The Representation of Hanzi Knowledge

(4)Markup Tags for Missing Character and Variant

(a) Kanji Placeholder

Applying SGML tag to markup missing character was first used by Wittern and App.【3】 The technique they developed is called Kanji Placeholder (漢字位標, 簡稱位標）。Kanji Placeholder starts with ”＆” and closes with “；” , and there are two fields in-between. The first field is an code identifier and the second field is a code word at which the missing glyph was found. Kanji Placeholder provides a linkage crossing Interchange Codes to share glyphs.

As an example,“&U4AB5;” represent a missing glyph found at location 4AB5 of Unicode which is identified by the field “U”. Kanji Placeholder is helpful to share glyphs collected by various Interchange Codes provided that there are a collection of Interchange Codes accessible be user. Wittern and App do have a Kanji Base on Internet which collects many Interchange Codes, such as CNS of more than 45 thousands glyphs and Unicode of approximately 22 thousands glyphs, etc. Besides, they also build a bank for missing characters which can not be found in any of the Interchange Codes they collected.

【3】 Christian Wittern and Urs App.〈IRIZ Kanji Base : A New Strategy for Dealing with Missing Chinese Characters 〉
世界電子佛典會議(EBTI)台北, 1996年4月

(b)Hanzi Glyphholder

Kanji Placeholder assigns SGML tag as the carrier of a pointer indicating the location of missing glyph. We extended their idea by applying SGML tag to show the structure of the missing glyph and use it as the identifier of the missing glyph also. The technique is called “Kanji Glyphholder”.

In order to avoid using any graphic symbol of code word as control symbol, special symbols 䏁 and 㗱 are created to represent open delimiter and close delimiter of the glyphholder tag, respectively. In the carrier formed by glyph holder tags is the glyph expression of missing glyph. For the convenience of use, component sequence is allowed to replace glyph expression as long as there is no ambiguous happened. For example, in Buddhist Canon 阿門佛can be expressed as阿䏁門人人人㗱佛, or阿䏁門　㗱佛.

The glyph holder can also be used to represent a variant by applying the glyph code of the “three segment coding scheme” of a character described in the later section. For example, “芍䏁藥‧3㗱” represent “芍葯”, if葯is the third variant of藥.

Glyphholder and placeholder are compatible with each other, because they share the same tagging structure. The two fields of the placeholder can also be accepted in glyphholder if a parser is designed to recognized them. The glyphholder has some interesting properties that the placeholder can not provide. For instance, the glyphholder is more readable than placeholder, the user is not required to looking for the missing glyph elsewhere in order to obtain an identifier or a code word for the missing character, the user’s front end is not required to equipped with a data base of collecting various glyphs and variants, and finally, the glyphholder can express the mapping relation between character and glyph.

(5)Attributes

The attributes we collected so far for a character are listed in the following table.

表六文字屬性欄位表　　（註:打”*”者, 可以重複）

甲﹑缺字屬性表

1.缺字統一編號	* 5.筆劃數	* 9.注音
2.交換碼	6.首筆	*10.異體字交換碼
3.內碼(造字檔內)	7.次筆	*11.登錄日期及修改記錄
* 4.部首	8.未筆	*12.提供缺字之各單位欄位
		(含編號及內碼)

乙﹑字形結構屬性表

1.所屬字集編號	* 5.筆劃	9.部件二
2.交換碼	6.首筆	10.部件三
3.字形碼	7.分解方式	11.字頻次
* 4.部首	8.部件一	12.字根頻次(當用為字根時)
		13.字根次(當用為字根時)