Evaluating Features for Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families

In the process of translating patent documents, a bilingual lexicon of technical terms is inevitable knowledge source. It is important to develop techniques of acquiring technical term translation equivalent pairs automatically from parallel patent documents. We take an approach of utilizing the phrase table of a state-of-theart phrase-based statistical machine translation model. First, we collect candidates of synonymous translation equivalent pairs from parallel patent sentences. Then, we apply the Support Vector Machines (SVMs) to the task of identifying bilingual synonymous technical terms. This paper especially focuses on the issue of examining the effectiveness of each feature and identifies the minimum number of features that perform as comparatively well as the optimal set of features. Finally, we achieve the performance of over 90% precision with the condition of more than or equal to 25% recall.


Introduction
For both high quality machine and human translation, a large scale and high quality bilingual lexicon is the most important key resource. Since manual compilation of bilingual lexicon requires plenty of time and huge manual labor, in the research area of knowledge acquisition from natural language text, automatic bilingual lexicon compilation have been studied. Techniques invented so far include translation term pair acquisition based on statistical co-occurrence measure from parallel sentences (Matsumoto and Utsuro, 2000), compositional translation generation based on an existing bilingual lexicon for human use (Tonoike et al., 2006), translation term pair acquisition by collecting partially bilingual texts through the search engine (Huang et al., 2005), and translation term pair acquisition from comparable corpora (Fung and Yee, 1998;Aker et al., 2013;Kontonatsios et al., 2014;Rapp and Sharoff, 2014).
Among those efforts of acquiring bilingual lexicon from text, Morishita et al. (2008) studied to acquire Japanese-English technical term translation lexicon from phrase tables, which are trained by a phrase-based SMT model with parallel sentences automatically extracted from parallel patent documents. Furthermore, based on the achievement above, Liang et al. (2011a) studied the issue of identifying Japanese-English synonymous translation equivalent pairs in the task of acquiring Japanese-English technical term translation equivalent pairs. Based on the technique and the results of identifying Japanese-English synonymous translation equivalent pairs in Liang et al. (2011a), Long et al. (2014) next studied how to identify Japanese-Chinese synonymous translation equivalent pairs from Japanese-Chinese patent families.
In the task of identifying Japanese-Chinese synonymous translation equivalent pairs from Japanese-Chinese patent families ( Figure 1) studied in Long et al. (2014), this paper modifies some of the features studied in Long et al. (2014) and further focuses on the issue of examining the effectiveness of each feature. This paper especially identifies the minimum number of features that perform as comparatively well as the optimal set of features, where the most effective feature is discovered to be the rate of intersection in translation by the phrase table. Based on the evaluation results, we finally achieve the performance of over 90% precision with the condition of more than or equal to 25% recall.

Japanese-Chinese Parallel Patent Documents
Japanese-Chinese parallel patent documents are collected from the Japanese patent documents From them, we extract 312,492 patent families, and the method of Utiyama and Isahara (2007) is applied 1 to the text of those patent families, and Japanese and Chinese sentences are aligned. In this paper, we use 3.6M parallel patent sentences with the highest scores of sentence alignment 2 . 1 We used a Japanese-Chinese translation lexicon consisting of about 170,000 Chinese head words. 2 The maximum score of the method of Utiyama and Isahara (2007) is set to be 1.0, while the lower bound of its score is about 0.152 with the 3.6M parallel patent sentences.

Phrase Table of an SMT Model
As a toolkit of a phrase-based SMT model, we use Moses (Koehn et al., 2007) and apply it to the whole 3.6M parallel patent sentences. Before applying Moses, Japanese sentences are segmented into a sequence of morphemes by the Japanese morphological analyzer MeCab 3 with the morpheme lexicon IPAdic 4 . For Chinese sentences, we examine two types of segmentation, Figure 2: Developing a Reference Set of Bilingual Synonymous Technical Terms i.e., segmentation by characters 5 and segmentation by morphemes 6 .
As the result of applying Moses, we have a phrase table in the direction of Japanese to Chinese translation, and another one in the opposite direction of Chinese to Japanese translation. In the direction of Japanese to Chinese translation, when Chinese side of parallel sentences are segmented by morphemes, we finally obtain 108M translation pairs with 75M unique Japanese phrases with Japanese to Chinese phrase translation probabilities P (p C | p J ) of translating a Japanese phrase p J into a Chinese phrase p C . When Chinese sentences are segmented by characters, on the other hand, we obtain 274M translation pairs with 197M unique Japanese phrases. For each Japanese phrase, those multiple translation candidates in the phrase table are ranked in descending order of 5 A consecutive sequence of numbers as well as a consecutive sequence of alphabetical characters are segmented into a token. 6 Chinese sentences are segmented into a sequence of morphemes by the Chinese morphological analyzer Stanford Word Segment (Tseng et al., 2005) trained with Chinese Penn Treebank. Japanese to Chinese phrase translation probabilities. In the similar way, in the phrase table in the opposite direction of Chinese to Japanese translation, for each Chinese phrase, multiple Japanese translation candidates are ranked in descending order of Chinese to Japanese phrase translation probabilities.
Those two phrase tables are then referred to when identifying a bilingual technical term pair, given a parallel sentence pair S J , S C and a Japanese technical term t J , or a Chinese technical term t C . In the direction of Japanese to Chinese, as shown in Figure 1 (a), given a parallel sentence pair S J , S C containing a Japanese technical term t J , Chinese translation candidates collected from the Japanese to Chinese phrase table are matched against the Chinese sentence S C of the parallel sentence pair. Among those found in S C ,t C with the largest translation probability P (t C | t J ) is selected and the bilingual technical term pair t J ,t C is identified. Similarly, in the opposite direction of Chinese to Japanese, given a parallel sentence pair S J , S C containing a Chinese technical term t C , the Chinese to Japanese phrase table is referred to when identifying a bilingual technical term pair.

Developing a Reference Set of Bilingual Synonymous Technical Terms
When developing a reference set of bilingual synonymous technical terms (detailed procedure to be found in Long et al. (2014)), as illustrated in Figure 2, starting from a seed bilingual term pair s JC = s J , s C , we repeat the translation estimation procedure of the previous section in both Japanese-Chinese direction and Chinese-Japanese direction six times in total, and generate the set CBP (s J ) of candidates of bilingual synonymous technical term pairs. Then, we manually divide the set CBP (s J ) into SBP (s JC ), those of which are synonymous with s JC , and the remaining NSBP (s JC ). As in Table 1, we collect 114 seeds, where the number of bilingual technical terms included in SBP (s JC ) in total for all of the 114 seed bilingual technical term pairs is around 2,300 to 2,400, which amounts to around 21 per seed on average 7 . As shown in Figure 1 (b), to all of those bilingual term pairs, the procedure of identifying the synonymous sets is applied.

Identifying Bilingual Synonymous Technical Terms by Machine Learning
In this section, we apply the Support Vector Machines (SVMs) (Vapnik, 1998) to the task of identifying bilingual synonymous technical terms. In this paper, we model the task of identifying bilingual synonymous technical terms by the SVMs as that of judging whether or not the input bilingual term pair t J , t C is synonymous with the seed bilingual technical term pair s JC = s J , s C .

The Procedure
First, let CBP be the union of the sets CBP (s J ) of candidates of bilingual synonymous technical term pairs for all of the 114 seed bilingual technical term pairs. In the training and testing of the classifier for identifying bilingual synonymous technical terms, we first divide the set of 114 seed bilingual technical term pairs into 10 subsets. Here, for each i-th subset (i = 1, . . . , 10), we construct the union CBP i of the sets CBP (s J ) 7 We manually generate the reference set by discarding the bilingual pairs which are judged as not synonymous with the seed pair. The procedure of generating the whole reference sets took about 30 hours, i.e., about 3 seconds for judging a bilingual term pair on average. of candidates of bilingual synonymous technical term pairs, where CBP 1 , . . . , CBP 10 are 10 disjoint subsets 8 of CBP .
As a tool for learning SVMs, we use TinySVM (http://chasen.org/˜taku/ software/TinySVM/). As the kernel function, we use the polynomial (1st order) kernel 9 . In the testing of a SVMs classifier, we regard the distance from the separating hyperplane to each test instance as a confidence measure, and return test instances satisfying confidence measures over a certain lower bound only as positive samples (i.e., synonymous with the seed). In the training of SVMs, we use 8 subsets out of the whole 10 subsets CBP 1 , . . . , CBP 10 . Then, we tune the lower bound of the confidence measure with one of the remaining two subsets. With this subset, we also tune the parameter of TinySVM for trade-off between training error and margin. Finally, we test the trained classifier against another one of the remaining two subsets. We repeat this procedure of training / tuning / testing 10 times, and average the 10 results of test performance. Table 2 lists all the features used for training and testing of SVMs for identifying bilingual synonymous technical terms. Features are roughly divided into two types: those of the first type f 1 , . . . , f 6 simply represent various characteristics of the input bilingual technical term t J , t C , while those of the second type f 7 , . . . , f 17 represent relation of the input bilingual technical term t J , t C and the seed bilingual technical term pair s JC = s J , s C Among the features of the first type are the frequency (f 1 ), ranks of terms with respect to the conditional translation probabilities (f 2 and f 3 ), length of terms (f 4 and f 5 ), and the number of times repeating the procedure of generating translation with the phrase tables until generating input terms t J and t C from the Japanese seed term s J (f 6 ).

Features
Among the features of the second type are identity of monolingual terms (f 7 and f 8 ), edit distance of monolingual terms (f 9 ), character bigram sim-  ilarity of monolingual terms (f 10 ), rate of identical morphemes (in Japanese, f 11 ) / characters (in Chinese, f 12 ), string subsumption and variants for Japanese (f 13 ), identical stem for Chinese (f 14 ), rate of intersection in translation by the phrase table (f 15 ), rate of intersection in translation by the phrase table for the substrings not common between the seed and a term (f 16 ), and translation by the phrase tables (f 17 ).
As we discuss in the next section, among all of those features, f 15 and f 16 , which utilize the rate of intersection in translation by the phrase table, are the most effective, where we add f 16 in this paper to those studied in Long et al. (2014). Table 3 shows the evaluation results for a baseline as well as for SVMs. As the baseline, we simply judge the input bilingual term pair t J , t C as synonymous with the seed bilingual technical term pair s JC = s J , s C when t J and s J are identical, or, t C and s C are identical. When training / testing a SVMs classifier, we tune the lower bound of the confidence measure of the distance from the separating hyperplane in two ways: i.e., for maximizing precision and for maximizing F-measure. As shown in Table 3, when we use the set of features which maximize precision, we achieve higher precisions of 89.0% and 90.4% for morpheme-based segmentation and character-based segmentation, respectively, compared with when we use all of the proposed features (86.5% and 89.0%) with the condition of more than or equal to 40% Fmeasure 10 . The sets of features which maximize precision are f 1∼6 + f 9∼16 for morpheme-based 10 Out of 655 (for morpheme-based segmentation) / 605 (for character-based segmentation) pairs which are correctly judged as synonymous with the seed pair by SVM , 197 (30.1%) / 161 (26.6%) are not judged as synonymous by the baseline method, i.e., neither the Japanese term nor the Chinese term is identical to that of the seed pair. On the other hand, out of 986 (for morpheme-based segmentation) / 927 (for character-based segmentation) pairs which are correctly judged as synonymous by the baseline method, 458 (46.5%) / 444 (47.9%) are judged as synonymous with the seed pair by SVM, while the rests are not judged as synonymous by SVM. given tJ , log of the rank of tC with respect to the descending order of the conditional translation probability P(tC | tJ ) f3: rank of the Japanese term given tC, log of the rank of tJ with respect to the descending order of the conditional translation probability P(tJ | tC ) f4: number of Japanese characters number of characters in tJ f5: number of Chinese characters number of characters in tC f6: number of times generating translation by applying the phrase tables the number of times repeating the procedure of generating translation by applying the phrase tables until generating tC or tJ from sJ , as in sC → · · · → tJ → tC, or, sJ → · · · → tC → tJ features for the relation of bilingual technical terms tJ , tC and the seed sJ , sC f7: identity of Japanese terms returns 1 when tJ = sJ f8: identity of Chinese terms returns 1 when tC = sC f9: edit distance similarity of monolingual terms f9(tX, sX) = 1 − ED(t X ,s X ) max(|t X |,|s X |) (where ED is the edit distance of tX and sX, and | t | denotes the number of characters of t.) f10: character bigram similarity of monolingual terms

Evaluating the Effectiveness of Features
(where bigram(t) is the set of character bigrams of the term t.) f11: rate of identical morphemes (for Japanese terms) (t) is the set of morphemes in the Japanese term t.) f12: rate of identical characters (for Chinese terms) (t) is the set of Characters in the Chinese term t.) f13: subsumption relation of strings / variants relation of surface forms (for Japanese terms ) returns 1 when the difference of tJ and sJ is only in their suffixes, or only whether or not having the prolonged sound " ", or only in their hiragana parts.
f14: identical stem (for Chinese terms) returns 1 when the difference of tC and sC is only whether or not having the word " " which is not the prefix or suffix. f15: rate of intersection in translation by the phrase table f15(tX , sX ) = |trans(t X )∩trans(s X )| max(|trans(t X )|,|trans(s X )|) ( where trans(t) is the set of translation of term t from the phrase table.) f16: rate of intersection in translation by the phrase table (for the substrings not common between tX and sX) Suppose that x 1 t , . . . , x m t and x 1 s , . . . , x n s are the substrings which are not common between tX and sX . Here, we find l (= min(m, n)) pairs of one-to-one mappings between x i t (i = 1, . . . , m) and x j s (j = 1, . . . , n) which maximize the product of the rates f15(x i t , x j s ) of intersection in translation by the phrase table and return this product. f17: translation by the phrase table returns 1 when sJ can be generated by translating tC with the phrase table, or, sC can be generated by translating tJ with the phrase table. segmentation and f 2,3 + f 6∼9 + f 11,12,15,16 for character-based segmentation, respectively. However, their differences are not significant (5% significance level). Next, we evaluate the effect of each single feature as well as combinations of small number of features, where, among those results, Table 4 shows pairs of features each of which achieves a precision with no significant difference (5% significance level) with the set of features having the maximum precision. It is obvious that features f 15 and f 16 , which utilize the rate of intersection in translation by the phrase table, are the most effective. Also, when we remove features f 15 and f 16 from all the features, precisions are significantly damaged (5% significance level) to 78.5% and 79.4% for morpheme-based and character-based segmentations, respectively. The reason why these features are the most effective among other features is that they directly measure the degree of being synonymous within one language with respect to the rate of intersection of translations into the other language, while other features just measure the character-based or morpheme-based similarity within one language. We further compare the performance of the proposed features with those studied in Tsunakawa and Tsujii (2008), where we modify the features of Tsunakawa and Tsujii (2008) as shown in Table 5, and then evaluate those modified features. As we compare the performance of the proposed features and the modified features of Tsunakawa and Tsujii (2008) in Table 3, it is clear that the pro- posed features outperform the modified features of Tsunakawa and Tsujii (2008).
Next, Table 6 shows examples of improvement by SVM compared with the baseline. As shown in Table 6 (a), the relation between input bilingual term pairs and seed bilingual term pairs is correctly judged as "synonym", while judgement by the baseline is "not synonym" since neither the Chinese terms nor the Japanese terms are iden-tical. In our proposed features, f 17 contributes to the correct judgement, where it returns 1 because of the existence of the translation pairs " "," " and " "," " in the phrase table. In the case of another example shown in Table 6 (b), on the other hand, the proposed method correctly judges as "not synonym" by SVM compared with the baseline, where both the edit distance similarity Table 6: Examples of Improvement in Identifying Bilingual Synonymous Technical Terms by SVM Baseline: Judge the input bilingual term pair tJ , tC as synonymous with the seed bilingual term pair sJ , sC when tJ and sJ are identical, or, tC and sC are identical. SVM: Maximize precision by tuning the lower bound of the confidence measure of the distance from the separating hyperplane (Chinese sentences are segmented by morphemes).
(a) Correct Judgement as "Synonym" only by SVM (b) Correct Judgement as "Not Synonym" only by SVM (f 9 ) and the character bigram similarity (f 10 ) between the Japanese terms " " and " " are 0 ( f 9 ( t J , t C , s J , s C ) = 0 and f 10 ( t J , t C , s J , s C )= 0).
Finally, Table 7 shows examples of erroneous judgements by SVM. As shown in Table 7 (a), since erroneous translation pairs " "," " and " "," " exist in the phrase table, both f 17 (both of the translations pairs s J , t C and t J , s C exist in the phrase table) and f 17 (either the translation pair s J , t C or t J , s C exist in the phrase table) return 1, resulting in erroneous judgement.
Another example is shown in Table 7 (b), where the proposed method returns erroneous judgement as "not synonym". In this case, since the translation pair " "," " only exists in the phrase table, f 17 (either the translation pair s J , t C or t J , s C exist in the phrase table) returns 1, while f 17 (both of translations pairs s J , t C and t J , s C exist in the phrase table) returns 0. Furthermore, even though Chinese words " " and " " are synonymous, their character bigram similarity is computed as 0, since they have opposite character orderings.

Related Work
Among related works on acquiring bilingual lexicon from text, Lu and Tsou (2009) and Yasuda and Sumita (2013) studied to extract bilingual terms from comparable patents, where, they first extract parallel sentences from comparable patents, and then extract bilingual terms from parallel sentences. Those studies differ from this paper in that those studies did not address the issue of acquiring bilingual synonymous technical terms. Tsunakawa and Tsujii (2008) is mostly related to our study, in that they also proposed to apply machine learning technique to the task of identifying bilingual synonymous technical terms. However, Tsunakawa and Tsujii (2008) studied the issue of identifying bilingual synonymous technical terms only within manually compiled bilingual technical term lexicon and thus are quite limited in its applicability. Our approach, on the other hand, is quite advantageous in that we start from parallel patent documents which continue to be published every year and then, that we can generate candidates of bilingual synonymous technical terms automatically. Furthermore, as we show in the previous section, the features proposed in this paper outperform that of Tsunakawa and Tsujii (2008).

Conclusion
In the task of acquiring Japanese-Chinese technical term translation equivalent pairs from parallel patent documents, this paper studied the issue of identifying synonymous translation equivalent pairs. This paper especially focused on the issue of examining the effectiveness of each feature and identified the minimum number of features that perform as comparatively well as the optimal set of features. One of the most important future work is definitely to improve recall. To do this, we plan to apply the semi-automatic framework (Liang et al., 2011b) which have been invented in the task of identifying Japanese-English synonymous translation equivalent pairs and have been proven to be effective in improving recall. Another important future work is to train the SVM of identifying bilingual synonymous technical pairs with a set Table 7: Examples of Errors in Identifying Bilingual Synonymous Technical Terms By the Proposed Method (a) Incorrect Judgement as "Synonym" by SVM (a) Incorrect Judgement as "Not Synonym" by SVM of patent families, and then to evaluate the trained SVM against parallel patent sentences and phrase tables extracted from another set of patent families.