Exploiting Common Characters in Chinese and Japanese to Learn Cross-Lingual Word Embeddings via Matrix Factorization

Learning vector space representation of words (i.e., word embeddings) has recently attracted wide research interests, and has been extended to cross-lingual scenario. Currently most cross-lingual word embedding learning models are based on sentence alignment, which inevitably introduces much noise. In this paper, we show in Chinese and Japanese, the acquisition of semantic relation among words can benefit from the large number of common characters shared by both languages; inspired by this unique feature, we design a method named CJC targeting to generate cross-lingual context of words. We combine CJC with GloVe based on matrix factorization, and then propose an integrated model named CJ-Glo. Taking two sentence-aligned models and CJ-BOC (also exploits common characters but is based on CBOW) as baseline algorithms, we compare them with CJ-Glo on a series of NLP tasks including cross-lingual synonym, word analogy and sentence alignment. The result indicates CJ-Glo achieves the best performance among these methods, and is more stable in cross-lingual tasks; moreover, compared with CJ-BOC, CJ-Glo is less sensitive to the alteration of parameters.


Introduction
Word representation is critical to various NLP tasks, and the traditional one-hot representation, despite its simplicity, suffers from at least two aspects: the vector dimensionality increases with vocabulary size, leading to "curse of dimensionality"; more importantly, it fails to capture the semantic relation among words.
Due to the defects of one-hot representation, the majority of research interests now have switched to distributed word representation (also known as "word embedding"), which represents word as a real-valued vector. Represented as vectors, the semantics of words are better reflected, as the relatedness of words can be quantified using vector arithmetic.
To efficiently train word embeddings, a range of models have been proposed, most of them targeting to train monolingual word embedding. Though word embedding is often discussed under monolingual scenario, crosslingual embedding can serve as a useful tool in several NLP tasks including machine translation (Wu et al., 2016), word sense disambiguation (Chen et al., 2014), and so on. This is because cross-lingual word embeddings map words from two languages into one vector space, thereby making it possible to measure the semantic relation among words from different languages. However, compared with the bulk of works studying monolingual word embedding, cross-lingual word embedding is still at its initial stage, with no learning model being widely accepted.
In this paper, we present a method named CJC (Chinese-Japanese Common Character) aiming to extract cross-lingual context of words from sentence aligned Chinese-Japanese corpus. Given the large amount of common characters shared by both languages and the rich semantic connections thereof, we exploit them to acquire potential word level alignment. The acquired cross-lingual contexts can be flexibly integrated with various models; in this paper, CJC is mainly integrated with a matrix factorization model called Glove (Pennington et al., 2014), and the integrated model is thus called CJ-Glo.
To evaluate the performance of CJ-Glo, we take 2 sentence aligned models respectively based on CBOW (Mikolov et al., 2013a) and GloVe, and CJ-BOC model (based on Common Character + CBOW)  as contrast, and compare the trained word embeddings of these methods using three typical NLP tasks, including cross-lingual synonym, word analogy and sentence alignment. According to the experiment results, the acquired word embeddings by using CJ-Glo have better quality than those of the other models; moreover, CJ-Glo performs more stably than its competitors, and is less sensitive to parameter alteration.

Related work
Word embedding was initiated by Hinton (1986), which essentially encodes word using a real-valued vector. With word embeddings, the intrinsic relatedness among words can be explicitly measured as the distances or angles between word pairs. This favorable feature of word embedding soon led to its popularity in industry and academia in past decades. Specifically, word embedding has found its applications in machine translation (Wu et al., 2016;Lample et al., 2017), word sense disambiguation (Chen et al., 2014;Guo et al., 2014), information retrieval (Vulić and Moens, 2015) and so on.
To efficiently acquire high-quality word embeddings, vast research efforts have therefore emerged. A representative framework to learn word embeddings is Neural Network Language Model (NNLM) proposed by Bengio et al. (2003), which adopts back-propagation when training word embeddings and parameters for the model. Another typical approach is matrix factorization, whose basic idea is to approximate original matrices with low-rank matrices by leveraging statistic information. For example, GloVe (Pennington et al., 2014) explicitly factorizes the co-occurrence matrix, training only non-zero elements instead of an entire spare matrix.
Traditionally, word embedding was studied under monolingual setting, and then naturally extended to bilingual scenario. Compared with monolingual word embeddings, bilingual word embedding reveals the internal relation among words of different languages; and such capability makes bilingual word embeddings a powerful tool to assist machine translation, or even serves as a substitute for word mapping matrix and dictionary in previous machine translation methods. A range of works have been proposed to learn bilingual word embeddings, such as (Mikolov et al., 2013b), which attempts to map separately trained word embeddings into one vector space, and acquire bilingual word embeddings. BilBOWA is a model proposed in (Gouws et al., 2015), whose most notable merit is the whole training process does not require word alignment or dictionary. word alignment or dictionary. (Shi et al., 2015) is another work that utilizes matrix factorization in word embeddings learning. Ruder et al. (2017) provides a detailed survey, which enumerates the input format and basic principles of various bilingual word embedding learning methods.
When it comes to non-alphabet-based language like Chinese and Japanese, an essential difference from alphabet-based languages is that each character in a word contains abundant information, and makes sense itself. In addition to this, an underlying correlation between Chinese and Japanese is the large portion of shared characters in both languages; with the help of these characters, Chu et al. (2014) extracted texts from Wikipedia web pages of Chinese and Japanese version, based on which they then constructed a Chinese-Japanese parallel corpus. A natural conjecture about the common characters is the semantic similarity or even equivalence among them. In light of this, we proposed CJ-BOC model in our previous work  to learn Chinese-Japanese bilingual word embed-dings, which outperforms sentence-alignment approaches in terms of embedding quality. To our knowledge, our previous work is the first attempt to learn Chinese-Japanese word embeddings using common Chinese characters.

Chinese-Japanese Common Character
Historically, Chinese character has spread to a group of countries in East Asia as a major carrier of Chinese culture, thereby influencing the writing systems in these countries. Traditional Chinese, Simplified Chinese and Japanese Kanji are now being used, all developing from Traditional Chinese; and given the same root of them, these three writing systems actually share a large portion of common characters: for a certain character in one of them, we can find its counterparts in the other two, with minor variation or even of the same shape. Chu et al. (2012) proposed a Chinese character table comparing traditional Chinese, simplified Chinese and Japanese. As summarized in Table 1, the glyphs of such common characters can be 1) the same in all these three writing systems; 2) consistent in two of them; 3) different in all these three.
And with regard to their semantics, simplified and traditional Chinese are only two written forms of the same language, and therefore common characters within them are semantically equivalent. For Japanese Kanji, most characters are semantically equivalent or relevant to their counterparts in Chinese.
We in our previous work ) quantified such semantic relatedness from the view of information theory using mutual information (MI) and conditional mutual information (CMI). By repeating the experiments in this paper, we acquired the results in Table 2. All these 5 characters have multiple meanings in both Chinese and Japanese, and their respective meanings differ to some extent in both languages. Normally CMI should be larger than MI, which indicates that in a translation-sentence pair, if 2 words from each sentence share a common character, they are likely to form a translation word pair. The results of shown in Table 2 are no exception, providing theoretical root for our model which will be proposed in section 4.

Context of Word and CJC Method
Before delving into the learning models, we should first clarify the concept of context. In natural language processing, a widely adopted semantic representation model is Bag-of-Words (Zhang et al., 2010). The fundamental assumption of this model is: within a given sentence or paragraph, the target word is prone to have the most intimate semantic relation with its closest context words. Formally define a sentence S with l words as an ordered sequence: S = ⟨w 0 , w 1 , ..., w l ⟩, and context function Ctx(·) is often formulated as: In cross-lingual scenario, besides two monolingual corpora of both languages, a parallel corpus is often required in most models, which is aligned in either word-level (Guo et al., 2016) or sentence-level. Some recent works attempted to learn embeddings without using parallel corpus, such as (Artetxe et al., 2017). Now try to consider bilingual context of a given target word in aligned parallel corpus. Let ⟨S zh , S ja ⟩ be a sentence pair, then define: As formulated above, the context of a target word is the union of its contexts in both sentences. Therefore in word-aligned parallel corpus, let ⟨w zh,i , w ja,j ⟩ be a pair of aligned words, and the cross-lingual context Ctx w (w zh,i , S ja ) is equal to Ctx w (w ja,j , S ja ), since contexts in both languages are taken into account in this definition. In sentencealigned parallel corpus, the cross-lingual context Ctx s (w zh,i , S ja ) is defined as the set of all the words in the respective sentence. In real applications, sentence alignment data are usually easier to acquire. For example, Chu et al. (2014) proposed an approach to align Chinese-Japanese cross-lingual wiki corpus, using the common characters between both languages.
According to the analysis in Section 3, given an aligned Chinese-Japanese sentence pair, word alignment can be performed upon word pairs that share common characters. Based on

Type
Example of Characters with Unicode Percentage Table 1: Corresponding examples and percentages(%) of common characters in Simplified Chinese (SC), Traditional Chinese (TC), and Japanese Kanji (KJ). this conclusion, using common characters, we can now give a definition for context similar to context in sentence-align corpus. Define a character matching function CC(·) that generates a set of word in which each word has at least one common character with target word w zh,i : Thus parallel context Ctx c (w zh,i , S ja ) can be acquired via common character matching: (4) Hence, when multiple words in the corresponding sentence have common characters with the target word, all of them will be included in Ctx c (w zh,i , S ja ). However, such case rarely occurs during our experiments.
We name this method as CJC (Chinese-Japanese Common Character) which uses CC(·) to determine context. Different from our previous work  which exploited common characters to facilitate only CBOW, this CJC method is more of a generalized scheme that can be integrated with various models including CBOW, Skip-Gram, GloVe etc.

CBOW-like Models
CBOW was a model proposed by Mikolov et al. in (Mikolov et al., 2013a), whose optimization goal is maximizing a probabilistic language model. In cross-lingual especially Chinese-Japanese scenario, the objective function for training w zh,i is: where P zh,i,zh , P zh,i,ja,c , andP zh,i,ja,s are softmax function of the target word w zh,i to its corresponding monolingual context, sentence aligned cross-lingual context, and CJC context. Both λ and µ here are parameters of the model. If λ = 0, this is a trial sentence aligned CBOW model, otherwise it is a CJC+CBOW model; the CJ-BOC model in our previous work  used similar approach, and would be used as a baseline in our experiments.

GloVe
GloVe model was originally proposed by Pennington et al. (2014). As the name implies, GloVe utilizes the global information of the corpus for vector training. GloVe and CBOW, as commonly adopted learning models, however differ a lot in terms of mathematical models, as they are respectively based on matrix factorization and neural network. The process of GloVe is as follows: First, construct a word-word cooccurence matrix M = (m ij ) n×n , where n is the size of the corpus, and m ij represents the number of occurrence of w j in the context of w i in all the sentences S.
The learning problem of GloVe can then be transformed into the optimization of function F (·), such that for any word embeddings x i , x j and probe word embedding x k , the objective function is defined below: In this function, both b i and b j are bias, and f is a weighing function aiming to mitigate the impact of dataset size on training results. In GloVe, m max is set to 100 and α to 3 4 .

Cross-lingual GloVe and CJ-Glo
To fit GloVe in cross-lingual scenario, one should first expand the word-word cooccurrence matrix. Suppose two languages respectively contain n and t words, the new matrix would have a size of If w i and w j belong to the same language, m ij can be computed using exactly the same way as in GloVe; otherwise, suppose (S zh , S ja ) is a pair of parallel sentences, w i ∈ S zh , w j ∈ S ja , and we have: Cnt(·) counts the frequency of w j in certain context of w i , either sentence aligned context or CJC context. Once the cross-lingual word-word cooccurence matrix is obtained, the following optimization unfolds similarly with the monolingual GloVe model, using the objective function (6) and weighting function (7) to train.
Similar to Cross-lingual CBOW model, if the CJC learning rate λ = 0 in equation (8), this is a sentence aligned cross-lingual GloVe model. Otherwise, it is a CJC-enhanced model, and is thus called CJ-Glo. Figure 1 demonstrates the operational principle of CJ-Glo: the square in this figure is a cross-lingual word co-occurrence matrix, in which the green square is a Chinese monolingual co-occurrence sub-matrix, and the orange square is for Japanese. The blue sections are cross-lingual sub-matrices and elements in them are calculated using equation (8). When two parallel sentences each contain a word sharing common characters, each word would be taken as a co-occurrence in the context of the other. Every point crossed by dotted lines and dotted rectangles represents an element to increment when processing the sentence pair.

Evaluation Methods
To evaluate the quality of cross-lingual word embeddings obtained from various models, we conducted three groups of experiments: 1) the straightforward cross-lingual synonym comparison; 2) cross-lingual word analogy; 3) sentence alignment.

Cross-lingual synonym comparison.
In monolingual scenario, the word embeddings of a pair of synonyms should have a high cosine similarity. This property is also applicable in cross-lingual word embeddings, i.e., the cosine similarity between a word embedding and its translated counterpart should also be high. In real applications, the correspondence between words in source language and words in target language can be one-to-one, one-to-many, or vice versa. To effectively eliminate ambiguity, we picked 200 one-to-one corresponding word pairs ⟨w zh , w ja ⟩ at random, then for each word pair, calculated the cosine similarity between w zh and w ja , denoted as d, and computed the rank of d among the cosine similarities from w zh to every Japanese word in corpus V ja . Use the rank to calculate its relative rate among all words: Conducted the same operation for w ja and all words in corpus V zh . Calculate the average rate for all the 200 word pairs, and acquire the average rate of w zh → w ja and w ja → w zh respectively. Ambiguity is eliminated in all these word pairs, so a large rate is therefore favored.

Cross-lingual word analogy.
Word analogy is probably the most widely adopted task to evaluate the performance of word embeddings, because it depicts the connection between trained vector space and word semantics. Both CBOW (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014) used a dataset with 19,544 queries for evaluation.
More formally, the cross-lingual analogy task was undertaken as follows: 1. Input a quadruple of word embeddings ⟨w 1 : w 2 :: w 3 : w 4 ⟩, where each word could be either Chinese or Japanese; 2. Compute the target vector u = w 2 −w 1 +w 3 , acquire the corresponding rank and rate as in cross-lingual synonym comparison for u → w 4 ; 3. Based on the ratio of Chinese word count to Japanese word count in the quadruple ⟨w 1 : w 2 :: w 3 : w 4 ⟩, the word analogy task is divided into 5 subtasks, whose ratio are (0 : 4), (1 : 3), (2 : 2), (3 : 1) and (4 : 0), and their respective query amount is 420, 1680, 2520, 1680, and 420 in our experiment; 4. Calculate the average rate on every subtask. Also, the average rate here is expected to be as large as possible.
The above experiments respectively evaluated the direct similarity and cross-lingual feature of word embeddings. And now we consider a more complicated task: sentence alignment. In the dataset from (Chu et al., 2014), other than training data, a manual test dataset was also attached, which are 198 sentence pairs. Using this dataset, we conduct this experiment as follows: 1. For a Chinese sentence S zh,i , calculate its average vector U zh,i and all U ja of all sentences S ja , and compute the cosine similarity. 2. Sort all the cosine similarities in step 2, and acquire the rank of the average vector U ja,i of S ja,i (the parallel sentence of S zh,i ). 3. Transform rank into rate using formula 10, where total number is 198. 4. Compute the average rate S zh → S ja ; 5. Follow the same steps above to generate S ja → S zh .
Compared with the previous experiments, which evaluate only the relation between individual word embeddings, sentence alignment is a comprehensive task using word embedding, and is a critical indicator for the overall quality of the trained word embeddings.

Dataset and Training Details
As mentioned previously, (Chu et al., 2014) generated a parallel corpus including Chinese-Japanese sentence pairs from Wikipedia; train.ja and train.zh in this dataset were used throughout our empirical study, both containing 126,811 lines of text. Concretely, every single line in these two files is a complete sentence, which is parallel to its counterpart in the other file. As the preprocessing for datasets, both files were segmented using MeCab 1 and Jieba 2 for Japanese and Chinese, respectively. During the preprocessing, we assured the segmentation on Chinese and The parameters of CJC learning rate λ and sentence learning rate µ are showed in Table 3. Both SenGlo and CJ-Glo have a m max of 100, and an α of 3 4 . The thread count is 16 in the implementations of all these four models, the output vector dimensionality is 100, and the training process is iterated 15 times. We set the parameters to the above values, since these models achieved the optimal performances under such settings in our evaluation. All models are implemented using C language, and the code can be found on GitHub 3 . Figure 2: Cross-lingual word analogy experiment result. X-axis is the number ratio of Chinese words and Japanese words in the analogy query (w 1 : w 2 :: w 3 : w 4 ).

Results
The result of cross-lingual synonym comparison is shown in Table 4, from which we can see the integration of Common Character leads to obvious performance improvement for both CBOW-like and GloVe-like models, compared with sentence-aligned models, and CJ-Glo achieve the best result. Figure 2 summarizes the results of the crosslingual word analogy task, whose X-axis represents the ratio of Chinese word count to Japanese word count. In the figure, the leftmost point represents the result of pure Japanese word analogy, and the rightmost is the pure Chinese word analogy. We can see that all 4 models achieve fair performances in pure Chinese/Japanese word analogy. However, when it comes to the cross-lingual word analogy, CJ-models outperform Sen-models, and GloVe-like models generally beat CBOWlike ones. Another noticeable fact is that CJ-Glo performs approximately good under all 5 ratios, showing basically no difference between cross-lingual and monolingual word analogy.
We display the sentence alignment results in Table 5. Similarly, we still find CJ-models outperform Sen-, and GloVe-like models beat CBOW-like ones. Again, CJ-Glo has the best performance.
According to the above experiments, we can see compared with typical sentence-aligned methods, Common Character enhanced mod- Moreover, CJ-Glo performs better than CJ-BOC, and is non-sensitive in cross-lingual tasks.

Model Analysis: CJC Learning Rate
CJC learning rate here refers to the multiplying factor of CJC context Ctx c (·), which is λ in CJ-BOC and CJ-Glo. It worths discussion that how would CJC learning rate affects the performance of our proposed models. To explore this issue, we conduct a simple experiment: fixing the other parameters as set in section 5.2, we only change CJC learning rate, and apply the acquired word embeddings to synonym w zh → w ja tasks. The results are displayed in Figure 3 , in which we can find as λ increases in CJ-BOC, the accuracy declines after an increase, showing a obvious local optimal. While in CJ-Glo, the accuracy keeps improving with the increase of λ. Note that both parameters should be less than 1, because otherwise the impact of cross-lingual context would dominate the learning process, obviously resulting in overfit. CJ-Glo is more stable during the change of CJC learning rate, this interesting difference between both models is related to the their underlying learning mechanisms.

Conclusion and Future Work
In this paper, we quantified the semantic connection among common characters shared by Chinese and Japanese, and utilized it as the theoretical root to propose our cross-lingual context extracting method CJC. CJC makes use of common characters of both languages to assist the acquisition of parallel contexts. The Figure 3: Accuracy of CJ-BOC and CJ-Glo Models on cross-lingual synonym w zh → w ja with different CC learning rate.
effectiveness of CJC enhanced matrix factorization model CJ-Glo was verified via a series of tasks including cross-lingual synonym, word analogy and sentence alignment. As the experiment result shows, models like CBOW and GloVe achieved notable performance gain after integrated with CJC. Furthermore, CJ-Glo performed the best among all evaluated stateof-the-art methods, and showed its stability on cross-lingual tasks and non-sensitiveness of training parameter changing.
Below are several directions we may work on in the future: 1) The idea of training character and word embeddings jointly (Chen et al., 2015) is applicable to Chinese-Japanese word embedding training. Meanwhile, we can also align common characters and train crosslingual character embeddings to further improve the quality of trained word embeddings. 2) A recent work (Lai et al., 2016) indicates that the performances of a model may vary given different tasks. Therefore, we shall study the performance fluctuation of CJ-Glo with more tasks including machine translation.