IsOBS: An Information System for Oracle Bone Script

Oracle bone script (OBS) is the earliest known ancient Chinese writing system and the ancestor of modern Chinese. As the Chinese writing system is the oldest continuously-used system in the world, the study of OBS plays an important role in both linguistic and historical research. In order to utilize advanced machine learning methods to automatically process OBS, we construct an information system for OBS (IsOBS) to symbolize, serialize, and store OBS data at the character-level, based on efficient databases and retrieval modules. Moreover, we also apply few-shot learning methods to build an effective OBS character recognition module, which can recognize a large number of OBS characters (especially those characters with a handful of examples) and make the system easy to use. The demo system of IsOBS can be found from http://isobs.thunlp.org/. In the future, we will add more OBS data to the system, and hopefully our IsOBS can support further efforts in automatically processing OBS and advance the scientific progress in this field.


Introduction
Oracle bone script (OBS) refers to characters carved on animal bones or turtle plastrons. To research OBS is important for both Chinese linguistic and historical research: (1) As shown in Figure 1, OBS is the direct ancestor of modern Chinese and closely related to other languages in East Asia (Xueqin, 2002). Analysis and understanding of OBS is vital for studying the etymology and historical evolution of Chinese as well as other East Asian languages.
(2) As shown in Figure 2, on one OBS document carved on one animal bone or turtle plastron, the number of characters ranges from fewer than ten to more than one hundred. Besides, as OBS is used for divination in ancient China, these documents cover a variety of topics, including war, ceremonial sacrifice, agriculture, as well as births, illnesses, and deaths of royal members (Flad et al., 2008). Therefore, OBS documents constitute the earliest Chinese textual corpora, and to analyze and understand OBS is of great significance to historical research.
Considering that it is often sophisticated and time-consuming to manually process ancient languages, some efforts have been devoted to utilizing machine learning techniques in this field. In order to detect and recognize ancient characters, Anderson and Levoy (2002); Rothacker et al. (2015); Mousavi and Lyashenko (2017); Rahma et al. (2017); Yamauchi et al. (2018) utilize computer vision techniques to visualize Cuneiform tablets and recognize Cuneiform characters, Franken and van Gemert (2013); Nederhof (2015); Iglesias-Franjo and Vilares (2016) apply similar techniques to recognize Egyptian hieroglyphs. For understanding the ancient text, Snyder et al. (2010) first show the feasibility of automatically deciphering a dead language by designing a Bayesian model to match the alphabet with non-parallel data. Then, Berg-Kirkpatrick and Klein (2011) propose a more effective decipherment approach and achieve promising results. Pourdamghani and Knight (2017) adopt a method similar to non-parallel machine translation (Mukherjee et al., 2018;Lample et al., 2018) to decipher related languages, which further inspires Luo et al. (2019) to propose a novel neural approach for automatic decipherment of Ugaritic and Linear B. Doostmohammadi and Nassajian (2019); Bernier-Colborne et al. (2019) explore to learn language models for Cuneiform Text.
These previous efforts have inspired us to apply machine learning methods to the task of processing OBS. However, there are still three main challenges: Figure 1: The historical evolution of the character "horse" from OBS to modern Chinese. (1) Different from those ancient Greek and Central Asian scripts, in which letters are mainly used to constitute words and sentences, OBS is hieroglyphic and does not have any delimiter to mark word boundaries. This challenge also exists in modern Chinese scenarios. (2) Although OBS is the origin of modern Chinese, it is quite different from modern Chinese characters. Typically, one OBS character may have different glyphs. Moreover, there are many compound OBS characters corresponding to multiple modern Chinese words. (3) There still lacks an effective and stable system to symbolize and serialize OBS data. Most OBS data is stored in the form of unserialized bone/plastron photos, which cannot support either recognizing characters or understanding text.
The above three challenges make it difficult to use existing machine learning methods for understanding OBS, and the third one is the most crucial. To this end, we construct an information system for OBS (IsOBS) to symbolize and serialize OBS data at the character-level, so that we can utilize machine learning methods to process OBS in the fu-ture: (1) We construct an OBS character database, where each character is matched to corresponding modern Chinese character (if it has been deciphered) and incorporates a variety of its glyphs. (2) We construct an OBS document database, which stores more than 5, 000 OBS documents. We also split the images of these documents into character images, and use these character images to construct both the OBS and corresponding modern Chinese character sequences for each document. (3) We also implement a character recognition module for OBS characters based on few-shot learning models, considering there are only a handful of examples for each OBS character. Based on the character recognition module, we construct an information retrieval module for searching in character and document databases.
The databases, character recognition module, and retrieval module of IsOBS provide an effective and efficient approach to symbolize, serialize, and store the data of OBS. We believe IsOBS can serve as a footstone to support further research (especially character recognition and language understranding) on automatically processing OBS in the future.

Application Scenarios
As mentioned before, IsOBS is designed for symbolizing, serializing, and storing the OBS data. Hence, the application scenarios of IsOBS mainly focus on constructing databases for both OBS characters and documents, as well as implementing character recognition and retrieval modules for data search.

Character Database for OBS
In IsOBS, we construct a database to store OBS characters. For each OBS character, both its corresponding modern Chinese character (just for those OBS characters that have been deciphered) and glyph set will be stored. As shown in Figure 3, users can input a modern Chinese character to  search for all glyphs of its corresponding OBS character. For those OBS characters that have no corresponding modern Chinese characters, we provide interfaces to utilize our character recognition module to search them. We will later introduce this part in more details.

Document Database for OBS
Besides the character database, we also construct a document database to store OBS documents. As shown in Figure 4, for each document in the document database, we store the image of its original animal bones or turtle plastrons, and both the OBS and modern Chinese character sequences of this document. By querying the specific identity num- ber designated by official collections, users can retrieve the corresponding OBS document from our database. In addition, we also align the character database with the document database, thus when users input one modern Chinese character to retrieve OBS glyphs, the documents mentioning this character can also be retrieved.

Character Recognition and Information Retrieval Modules
Since OBS characters are hieroglyphs and the character-glyph mappings are quite complex, the character recognition module is thus designed to deal with these complex mappings of input glyph images to their OBS characters. As shown in Figure 5, after we input the handwritten glyph image of the character, the character recognition module returns several latent matching pairs of OBS characters and their corresponding modern Chinese characters. Users can select one matching result for the next search. We also provide other commonly used retrieval methods (e.g. index retrieval), which is helpful for users to quickly find characters and documents in our system to conduct further research.

System Framework and Details
In this section, we mainly focus on introducing the overall framework and details of our system, especially introducing how to construct OBS databases and build the character recognition module. The overall framework of IsOBS including all databases and modules is shown in Figure 6.

OBS Databases
Our databases are constructed from two wellknown collections. One is the collection of OBS rubbings and standardized characters compiled by experts in Chinese Academy of Social Sciences (CASS) (Moruo and Houxuan, 1982), and the   other one is the collection of variant written forms (glyphs) of OBS characters with their corresponding modern Chinese characters (Zhao et al., 2009). For standardized OBS document collection, our databases now contain more than 5, 000 items, each including images of OBS rubbings, corresponding standardized OBS characters and their modern Chinese characters. Previous database platforms have not been able to cut out individual characters, making it difficult to support automatic operations. While our platform can provide finer-grained oracle data in a sequential form, which makes it easier for various electronic systems to conduct operations.
For hand-written OBS character collection, we obtain 22, 161 oracle character examples in 2, 342 classes, from which we create our dataset for training and testing our character recognition module.

Character Recognition Module
In available OBS character data, each character class usually has just a handful of examples. Due to the scarcity of OBS data, we adopt few-shot learning model for our classifier to capture the patterns from small amounts of data. Specifically, we implement prototypical network (Snell et al., 2017) for classification, which learns a non-linear mapping to embed examples into a feature space where those examples of the same class will cluster around a single prototype representation, as shown in Figure 7.
The architecture of the prototypical network is shown in Figure 8, and we denote the prototypical network as f φ : R D → R M for simplicity, where φ is the parameters to be learned by training, D and M stand for the dimension of the input data and the dimension of the embedded features respectively.
For each class, the prototype c i is set as the average of the embeddings of the support set, so the prototype of the class i can be denoted as where n i is the number of samples in the support set of the class.
For each query x, we use f φ to embed the query instance, then compute the distribution of x by the softmax of euclidean distances between f φ (x) and the prototypes of each class, in other words, .
Aside from prototypical network, we apply other neural networks for comparison, and finally select the most powerful one for our character recognition module. We adopt relation network (Sung et al., 2018), which is also an effective model in the area of few-shot learning, and siamese network (Chopra et al., 2005), for it is also a widely-used model in the area of character classification.

Experiment and Evaluation
We evaluate different character recognition models on self-created dataset. The results show that our implementation of prototypical network can achieve stable and competitive results. The datasets and source code can be found from https: //github.com/thunlp/isobs.

Dataset
Our newly created dataset is obtained from the collection of hand-written OBS characters mentioned in 3.1. The whole dataset contains 22, 161 character images from 2, 342 classes annotated by experts in OBS character research, each class refers to a unique character and is available on our website.  Each image in the dataset is 110 by 200. Considering that both the training and test set should not be empty for each class, our experiment is conducted on part of the dataset, which contains 1, 621 classes and 20, 420 character images. Due to the lack of enough few-shot training data for certain classes, we created three datasets as shown in Table  1. Each dataset is partitioned into training examples and test examples in 3 or 4 to 1 ratio.

Evaluation Metric
As mentioned above, we use prototypical network to classify OBS characters. For the training part, we use typical few-shot learning method to train the prototypical network. For the evaluation part, as aiming to evaluate the practicability of the model as an OBS character classifier, we score our model by using the top-k accuracy of the whole classification over given dataset, rather than common fewshot learning evaluation. Considering that only the classes in oracle300 have ample data to do fewshot training, we use the training set of oracle300 to train our model, and perform classification evaluation respectively on oracle300, oracle600 and oracle1600.

Neural Network Hyper-Parameters
For the few-shot learning models, in each epoch, we train 100 steps. In each step, we randomly select 60 classes for training prototypical network, while the number of selected classes for relation network is 5. For each class, there are 5 randomly chosen support examples and 5 query examples. The learning rate is set to 0.001 at the beginning, and decreases by half for every 20 (for prototypical network) or 100, 000 (for relation network) steps. For siamese network, the learning rate is set to 0.0001, and weight-decay 0.00001. Table 2 shows the overall performance of prototypical network on different datasets, and Table  3 shows the performance of different models on oracle600. From these two tables, we can find that:

Overall Results
(1) Prototypical network performs well on both oracle300 and oracle600, with the top-10 accuracy more than 90%.
(2) When generalized to oracle1600, which is larger and consists classes that contains scanty examples, our model still reaches 54.4% accuracy, indicating that our model works in generalized circumstance. As we just train models on oracle300 i.e, most characters in the test sets are not contained in the training set, this is a quite difficult scenario.
(3) Prototypical network notably outperforms  Considering prototypical network outperforms other models, our character recognition module is finally based on prototypical network.

Conclusion and Future Work
As to research OBS is important for both Chinese linguistic and historical research, we thus construct an information system for OBS and name the system IsOBS. IsOBS provides an open digitalized platform consisting of the OBS databases, the character recognition module, and the retrieval module. The experimental results further demonstrate that our character recognition module based on fewshot learning models have achieved satisfactory performance on our self-created hand-written OBS character dataset.
In the future, we plan to explore the following directions: (1) to include more OBS document and character data from collection books into our existing databases, (2) to employ generative learning and adversarial algorithms to add more robustness to our model, and (3) to construct a language model for ancient languages. We believe that these three directions will be beneficial for ancient languages research and support further exploration of utilizing machine learning for understanding OBS.

Acknowledgments
This work is supported by the National Key Research and Development Program of China (No.