Eigencharacter: An Embedding of Chinese Character Orthography

Chinese characters are unique in its logographic nature, which inherently encodes world knowledge through thousands of years evolution. This paper proposes an embedding approach, namely eigencharacter (EC) space, which helps NLP application easily access the knowledge encoded in Chinese orthography. These EC representations are automatically extracted, encode both structural and radical information, and easily integrate with other computational models. We built EC representations of 5,000 Chinese characters, investigated orthography knowledge encoded in ECs, and demonstrated how these ECs identified visually similar characters with both structural and radical information.


Introduction
Chinese is unique in its logographic writing system. The Chinese scripts consists of a sequence of characters, each carried rich linguistic information on its own. Chinese characters are not only mediums for pronunciations and lexical meanings, they also carry abundant information in the visual patterns.
Chinese orthography has been closely investigated in literature, from structural analysis of ShuoWenJieZi in Han dynasty, to contemporary sociolinguistic perspective (Tsou, 1981). Recent behavioral studies even argued that, given the salience of Chinese writing system, the orthographic components are activated first when reading, followed by phonological and semantic activation (Perfetti et al., 2005). However, emphases of previous orthographic approaches were more on radicals, components, or their respective positions in the whole characters and how Chinese readers recognize characters in a processing the-ory. This paper presents a computational approach, eigencharacter representations, to describe Chinese characters in a vector space. The resulting representations encode lexical knowledge embedded in Chinese characters, therefore provide unique insights on the integration between computational models and linguistics.

Previous Works
Chinese characters are visual patterns occupied in a square space. Depending on the strokes of a character, the visual pattern may be simple, such as a single stroke character (⼀, yī, "one") or complex, such as a 16 stroke character (⿔, guī, "turtle"). Psychophysics studies showed that Chinese characters carry more information in high spatial frequency, compared with alphabetic language (Wang and Legge, 2018). Although some Chinese characters are unique characters, which there is no further components can be distinguished in the whole character, identifying radicals and components in a character is the most common way to analyze Chinese orthography.

Components decomposition
In Chinese classic text, ShuoWenJieZi identified 540 radicals in Chinese characters, from which 214 of them are derived and used in modern Chinese. The radical often carries a semantic meaning of a character, and rest of the characters form a component which may provide hints of character pronunciation. For example, 燃, rán, "burning" has radical ⽕, huǒ, "fire", in the left side, which has apparent semantic connection between the whole character. The right side of the character, 然, rán, "then" provides a phonological cue, which is the same as the whole character in this exam-ple. The decomposition strategy are especially useful in pedagogical context and behavioral experiment, since it separates meanings from pronunciations, so they can be taught or manipulated separately.
Not all Chinese characters are applicable to this decomposition strategy. Some characters has unique structure, which it cannot be easily separated as radical and components. For instance, 東, dōng, "east" has radical of ⽊, mù, "wood", but the rest of the character, ⽈, yuē, "say", is tightly embedded in the character and has no phonological relations with the whole character. There are even some characters cannot be decomposed at all, such as 我, wǒ, "I, me", cannot be decomposed into a component after removing the radical (⼽, gē, "weaponery").
Some studies tried to decompose characters into finer components, through which characters can be divided recursively into a component hierarchy (Chuang and Hsieh, 2005). For example, instead of decomposing 燃 into its semantic radical (⽕) and phonological component (然), the phonological component can be further divided into three components: ⼣, xì, "dusk", ⽝, quǎn, "dog", ⺣, huǒ, "fire". This approach provides a complete description of characters, but is not without caveats. Specifically, it is not easy to find visually similar characters with component hierarchy, (e.g. 已 and ⼰ are visually similar, while not sharing common components), and the definition of a components is not always clear (e.g. ⿓, lóng, dragon could have one, two or three components, depending on different definitions).

Eigendecomposition of Visual Stimuli
Although decomposing characters into components are advantageous in pedagogical context and in behavioral experiments, the discrete nature of components prevents a simple coding scheme of Chinese orthography. Specifically, there are 214 radicals in modern Chinese, which would require hundreds of dimension in a vector to encode radicals and other components. In addition, were positions of each radicals/components considered, the dimensions needed to encoded a single character would increase exponentially. An alternative approach to construct a com-putational representation of Chinese character is leveraging the fact scripts are written in square blocks, each character can be considered as an information-laden visual patterns. The computation task is to extract common components among these patterns (characters), and choose fewest possible number of components to best represent given set of characters. The idea is closely related to eigenface decomposition in face recognition and face processing studies (Sirovich and Kirby, 1987). Chinese characters and faces are two distant but striking similar concepts, both in computational tasks and in cognitive neuroscience. Face and characters were shown to share similar processing mechanisms and even found to have closely related neural mechanisms (Farah et al., 1995;Zhang et al., 2018). In addition, face and (handwritten) character recognition were both attempted in a low dimensional space (Sirovich and Kirby, 1987;Long et al., 2011). The low dimensional face space (eigenface) was later applied into cognitive science, through which a face space was constructed and was used to explain phenomena concerning face recognition (O'toole et al., 1994).
In this paper, inspired by concepts of eigenface, we tried to construct a eigencharacter space to represent Chinese characters, and investigate the orthographic information implicated in eigencharacters.
Constructing eigencharacters provides unique advantages in computational modeling. These representations are invaluable that they are (1) clearly and automatically defined given a set of characters; (2) helpful when finding similar characters even when not sharing common components; (3) insightful when considering Chinese orthography on their structure and essential components; (4) easily manipulable and conveniently incorporated, since they are inherently a vector, into recent computational models (e.g. neural network models).

Eigencharacter
We constructed eigencharacter space with 5,000 most frequently used characters, which was the estimated vocabulary size of average college students in Taiwan (Hue, 2003). Mean strokes of these characters was 12.24, standard deviation was 4.48. Character with the fewest stroke (1 stroke) was ⼀, yī, "one"; the one with most strokes was 籲, yù, "call, implore".
Each character was first drawn with white ink on a binary bitmap of black background. The font used was Microsoft JhenHei, font size was 64. After drawing characters on bitmaps, they were reshaped into column vectors of length 4800 (i.e. 64 × 75). The resulting character matrix therefore has dimension 4800 × 5000 matrix.
The character matrix was then decomposed with singular value decomposition: where M is the original character matrix, Σ is a diagonal matrix with singular values. To determine the number of singular vectors, or number of eigencharacters(ECs) needed to best represent the character matrix, we first examined the scree plot of singular values normalized by the Frobenius norm of M ( Figure  1).
From Figure 1, proportion of variance explained quickly dropped after 50 ECs. To verified the observation, we attempted to reconstruct the character with first 10, 50 and 100 ECs (i.e. the first 10, 50, 100 columns of U ). The resulting construction is shown in Figure 2. The reconstruction of first 10 ECs only recovered limited patterns of each character. Interestingly, the patterns recovered were mostly vertical or horizontal stripes. When using 50 ECs, the resulting patterns started to be recognizable, and they were identifiable when using 100 ECs. Basing on the results above, we chose first 50 ECs to construct eigencharacters space.

Experiments
Constructed ECs space serves multiple purposes. Among their potential advantages on incorporating orthographic knowledge naturally inherited in Chinese writing system, we demonstrate how ECs reveal structure and component information in Chinese orthography, and how they are particularly effective in finding visually similar characters.

Rendering Eigencharacters
ECs are abstract mathematical construct extracted from singular value decomposition, which might not be directly interpretable. However, these ECs are essentially the bases best represent 5,000 characters, the actual patterns of these ECs could bear interesting insight on Chinese orthography.
We rendered 50 ECs extracted in previous section, and reconstructed them as if they were normal character. The renderings were shown in Figure 3, ECs are ordered descendingly by their respective singular values.
The rendering showed interesting patterns. By visual inspection, we can observed that (1) first few ECs encode "low spatial frequnecy" information, such as the general character block in EC0, vertical stripes in EC1, EC2, and horizontal stripes in EC4, EC5; (2) they do not correspond directly to radicals, but some important radicals can be identified nevertheless, such as ⺡ radical, "water" in EC14, ⾔ radical, "words" in EC15, and ⼥ radical, "female" in EC31.
In addition to visual inspection, we could also understand ECs by the characters having the highest or the lowest coefficients in each ECs. By these positively or negatively loaded characters, we could infer the information each EC encodes in character space. For example, the 3 highest loaded characters on EC0 are 轟, 竇 and 鷹, and the 3 lowest loaded characters are ⼀, ⼘, and ⼆, aided by the EC rendering, we could infer EC0 is a component of "stroke complexity". Likewise, the 3 highest loaded characters of EC1 are 圄, 鬩, and 閘, the 3 lowest loaded are 值, 椿, and 捧. EC1 is then infered to be a component of "enclosing structure". These components of structure echoed the behavioral studies that showed Chinese readers use structural information to judge character similarity (Yeh and Li, 2002). Aside from components of structural representation, there are also components of radicals. For instance, EC31, which shows a ⼥ radical in rendering, is a component of ⼥ radical. It has highest loading in characters with ⼥ radical, such as 媒, 娩, and 妮.
Renderings of ECs and inspection of their loaded characters, suggest ECs are not only abstract mathematical constructs. Instead, they automatically encode and reflect structural and radical aspects in Chinese orthography.

Finding Similar Characters
Eigencharacters encodes structural and radical information in characters, which would be ideal to find visual similar characters that is otherwise impossible using components decomposition approach. Table 1 show examples of similar characters identified with eigencharacters. Character similarity is defined as the euclidean distance between two characters in EC space. In first row of table 1, EC space found similar characters with identical radical (⺡), components(胡), and remarkably considering the three parts vertical structure simultaneously. In the second row of the table, the similar characters of 語, highlighted another property of EC space: it did not restrict itself on the exact components, but the visually similar components, such as 諮, 晤 and 誤, they either share the same radical/component, or having similar right hand side components. The last row also showed the advantages of EC space in finding visually similar character. For instance, 東 and 泉 are both unique structure, and they share similar patterns (⽈ in the middle, and two oblique strokes in lower half) which would be challenged to accommodate were componentsbased decomposition were used.
These illustrative examples showed EC space, which inherently equipped with knowledge of structural and radical information, provides an ideal representation to explore Chinese orthography.

Conclusion
This paper introduces eigencharacters, an embedding representation of Chinese orthography.
It provides unique advantages over  component-based character decomposition, in that it can be automatically extracted, encodes both structural and radical information, and easily integrates with other computational models. Equipped with EC representations, human knowledge encoded in Chinese orthography becomes easily accessible to downstream NLP applications.