中央研究院古籍全文資料庫的發展概要 6-1

謝清俊 、林晰　1997年3月

摘要

中央研究院利用計算機處理古籍已有十二年，其中以全文資料庫的發展最受矚目，目前上線的全文資料庫文總字數已超過一億一仟萬字，其所用的技術則全由院內同仁自行開發。參與製作資料庫的共有五所：史語所、臺史所、資訊所、近史所、文哲所，以及本院計算中心，總統府國史館亦積極參與清史資料庫之開發。1995年開始，有些大學與本院發展合作關係共享古籍資料，包括國內的中山﹑中正﹑師大各大學，國外的倫敦大學、史丹佛大學、密西根大學、香港中文大學等。本文首先介紹各全文資料庫的發展現況，其次介紹自行開發的相關技術，包括：全文資料庫的結構、文章的標誌系統、資料登錄之管理、缺字造字之管理以及目前各單位相關的研究發展計劃等。

Abstract

A survey of full-text data bases and related text processing techniques for Chinese ancient document in the past 12 years in Academia Sinica is presented in this paper. Five Institutes, (namely the Institute of History and phonology, the Institute of Taiwan History, the Institute of Literature and Philosophy, the Institute of Information Science and the Institute of Modern History ) and the Computing Center of Academia Sinica actively participated in this long range project since 1984. Beside, the Archival Library of National History also participated in developing the database of Ching Dynasty. Since 1995, some co-laboration projects with other Universities, such as London University in England, Stanford University, Michigan University in USA, Chinese University in Hong Kong and Chung-Cheng University, Chung-San University and National Taiwan Normal University in Taiwan have been launched to produce more digital texts. Now, the total character count of on-line full-text data bases are over 115 millions, and the data bases of more then 80 million characters are coming. In this report, we also survey some important techniques developed, including the structure of full-text database, the ways of handling missing characters, the management of data entry jobs, the development of markup system, etc. Besides, the status of some on going related research projects are summarized in this paper as a future perspective of the development of digital Chinese ancient documents.