A Descriptive Method for Re-engineering Hanzi Information Interchange Codes 9-1

謝清俊

1.Criticisms of Existing Interchange Codes

Every body agrees that the existing Hanzi Interchange Code is not good enough for daily applications. It seems never satisfy users requirement. As described in Part 1, obviously, the missing character problem is a fatal evidence of the insufficiency of the existing Hanzi Interchange Codes. The consequences of missing character is severe. It seems that the only solution to the problem is to re-engineering the existing coding system for Hanzi.

It is interest to notice that the meaning of “ missing character” is not clearly defined. It is very fuzzy what is missing, a character, a glyph, or a typeface ? This question leads to another drawback of interchange code. That is the existing code do not distinguish character and glyph. Although there are definitions for character and glyph stated in document, but they seem never been faithfully implemented. For example, there are two code positions 饑（C4C8）and飢（B047）in Big-5 Code. Are they representing different characters? By definition, they should represent different characters because they have different code words. But, in fact, they are merely two glyphs of a character. This situation is common to all existing codes without an exception.

One may argue that there are similar situations in ISO 646, such as the capital and the lower case of alphabets. For alphabets, it is true that there are two set of glyphs in ISO 646, but don’t miss the fact that they also share the same collating sequence in ISO 646. In Hanzi codes, there is no such structure to show the corresponding collating sequence between variants, except CCCII and EACC. Therefore, a structure which maps a character to its associated glyphs, or variants, should be added to the existing coding structure.

Another drawback of the existing coding structure is the implied assumption that the set of Hanzi is a closed finite set just like that of alphabets. Theoretically speaking , this may be true as long as the Chinese language do not produce new characters ever since. But in real life, this assumption is too strong to be true and not practical at all. Various reasons are given in Part 1. Besides, so far, no one knows how many characters are there for Hanzi. Therefore, from engineering point of view, an open coding structure is required to accommodate the characteristics of Hanzi set. Otherwise, unavoidably, the missing character problem will never be eliminated, no matter how diligent we try to collect “all” of the possible glyphs that may be founded in numerous documents.

The central theme of solving the missing character problem is to represent more knowledge of Hanzi, especially the knowledge about glyph, into computer so that computer can use related knowledge to do what we want it to do. By doing so, code must be changed , because Hanzi code is a major knowledge structure for Hanzi in computer at presentation situation. Besides, in order to improve the performance of the existing Hanzi processing system for possible applications in the future, the following principles should be considered.

Do not sacrifice Hanzi information sharing for solving coding or missing character problem.
The solution should be fair to all Hanzi used in different countries and regions.
Establish the capability to represent, to input, to search, to share and to manage missing characters.
Give formal working definitions to character, glyph, font, typeface, etc. so that they are acceptable to linguistics and Wen-zi-xue. Formalize their relations.
Establish a database for the attributes of Hanzi.
Use ISO 8879 SGML to describe missing glyphs and Hanzi text files for text sharing.
Expandability and flexibility of the system must be considered for further adopting traditional Wen-zi-xue knowledge into the system.
Keep the working environment stay in two-byte code space so that the existing application software can still apply.