Passive and Pervasive Use of Bilingual Dictionary in Statistical Machine Translation

There are two primary approaches to the use bilingual dictionary in statistical machine translation: (i) the passive approach of appending the parallel training data with a bilingual dictionary and (ii) the pervasive approach of enforcing translation as per the dictionary entries when decoding. Previous studies have shown that both approaches provide external lexical knowledge to statistical machine translation thus improving translation quality. We empirically investigate the effects of both approaches on the same dataset and provide further insights on how lexical information can be reinforced in statistical machine translation.


Introduction
Statistical Machine Translation (SMT) obtains the best translation, e best , by maximizing the conditional probability of the foreign sentence given the source sentence, p(f|e), and the a priori probability of the translation, p LM (e) (Brown, 1993). State-of-art SMT systems rely on (i) large bilingual corpora to train the translation model p(f|e) and (ii) monolingual corpora to build the language model, p LM (e).
One approach to improve the translation model is to extend the parallel data with a bilingual dictionary prior to training the model. The primary motivation to use additional lexical information for domain adaptation to overcome the out-ofvocabulary words during decoding (Koehn and Schroeder, 2007;Meng et al. 2014;Wu et al. 2008). Alternatively, adding in-domain lexicon to parallel data has also shown to improve SMT. The intuition is that by adding extra counts of bilingual lexical entries, the word alignment accuracy improves, resulting in a better translation model (Skadins et al. 2013;Tan and Pal, 2014;Tan and Bond, 2014).
Another approach to use a bilingual dictionary is to hijack the decoding process and force word/phrase translations as per the dictionary entries. Previous researches used this approach to explore various improvements in industrial and academic translation experiments. For instance, Tezcan and Vandeghinste (2011) injected a bilingual dictionary in the SMT decoding process and integrated it with Computer Assisted Translation (CAT) environment to translate documents in the technical domain. They showed that using a dictionary in decoding improves machine translation output and reduces post-editing time of human translators. Carpuat (2009) experimented with translating sentences in discourse context by using a discourse specific dictionary annotations to resolve lexical ambiguities and showed that this can potentially improve translation quality.
In this paper, we investigate the improvements made by both approaches to use a bilingual dictionary in SMT. We refer to the first approach of extending the parallel data with dictionary as the passive use and the latter approach of hijacking the decoding process as the pervasive use of dictionary in statistical machine translation.
Different from the normal use of a dictionary for the purpose of domain adaptation where normally, a domain-specific lexicon is appended to a translation model trained on generic texts, we are investigating the use of an in-domain dictionary in statistical machine translation.
More specifically, we seek to understand how much improvement can be made by skewing the lexical information towards the passive and pervasive use of the dictionary in statistical machine translation.

Passive vs Pervasive Use of Dictionary
We view both the passive and the pervasive use of a dictionary in statistical machine translation as a type of lexically constrained statistical hybrid MT where in the passive use, the dictionary acts a a supplementary set of bi-lexical rules affecting word and phrase alignments and the resulting translation model and in the pervasive use, the dictionary constraints the decoding search space enforcing translations as per the dictionary entries.
To examine the passive use of a dictionary, we explore the effects of adding the lexicon n number of times to the training data until the performance of the machine translation degrades.
For the pervasive use of a dictionary, we assign a uniform translation probability to possible translations of the source phrase. For instance, according to the dictionary, the English term "abnormal hemoglobin" could be translated to 異常ヘモグ ロビン or 異常血色素, we assign the translation probability of 0.5 to both Japanese translations, i.e. p(異常ヘモグロビン | abnormal hemoglobin) = p(異常血色素 | abnormal hemoglobin) = 0.5. If there is only one translation for a term in the dictionary, we force a translation from the dictionary by assigning the translation probability 1.0 to the translation.
One issue with the pervasive use of dictionary translations is the problem of compound phrases in the test sentence that are made up of component phrases in the dictionary. For instance, when decoding the sentence, "Here was developed a phase shift magnetic sensor system composed of two sets of coils , amplifiers , and phase shifts for sensing and output .", we fetch the following entries from the dictionary to translate the underlined multiword term: • magnetic sensor = 磁気センサ In such a situation, where the dictionary does not provide a translation for the complete multiword string, we set the preference for the dictionary entry with the longest length in the direction from left to right and select "magnetic sensor" + "system" entries for forced translation. 1 Finally, we investigate the effects of using the bilingual dictionary both passively and pervasively by appending the dictionary before training and hijacking the decoding by forcing translations using the same dictionary.

Experimental Setup
We experimented the passive and pervasive uses of dictionary in SMT using the Japanese-English dataset provided in the Workshop for Asian Translation (Toshiaki et al. 2014). We used the Asian Scientific Paper Excerpt Corpus (ASPEC) as the training corpus used in the experiments. The AS-PEC corpus consists of 3 million parallel sentences extracted from Japanese-English scientific abstracts from Japan's Largest Electronic Journal Platform for Academic Societies (J-STAGE). In our experiments we follow the setup of the WAT shared task with 1800 development and test sentences each from the ASPEC corpus.
We use the Japanese-English (JA-EN) translation dictionaries (JICST, 2004) from the Japan Science and Technology Corporation. It contains 800,000 entries 2 for technical terms extracted from scientific and technological documents. Both the parallel data and the bilingual dictionary are tokenized with the MeCab segmenter (Kudo et al. 2004).
• For English translations, we trained a truecasing model to keep/reduce tokens' capitalization to their statistical canonical form (Wang et al., 2006;Lita et al., 2003) and we recased the translation output after the decoding process Addtionally, we applied the following methods to optimize the phrase-based translation model for efficiency: • To reduce the size of the language model and the speed of querying the model when decoding, we used the binarized trie-based quantized language model provided in KenLM (Heafield et al. 2013, Whittaker andRaj, 2001) • To minimize the computing load on the translation model, we compressed the phrase-table and lexical reordering model using the cmph tool (Junczys-Dowmunt, 2012) For the passive use of the dictionary, we simply appended the dictionary to the training data before the alignment and training process. For the pervasive use of the dictionary, we used the xml-input function in the Moses toolkit to force lexical knowledge in the decoding process 3 . 3 http://www.statmt.org/moses/?n=Advanced.Hybrid#ntoc1 Table 2 presents the BLEU scores of the Japanese to English (JA-EN) translation outputs from the phrase-based SMT system on the WAT test set. The leftmost columns indicate the number of times a dictionary is appended to the parallel training data (Baseline = 0 times, Passive x1 = 1 time). The rightmost columns present the results from both the passive and pervasive use of dictionary translations, with exception to the top-right cell which shows the baseline result of the pervasive dictionary usage without appending any dictionary.  By repeatedly appending the dictionary to the parallel data, the BLEU scores significantly 4 improves from 16.75 to 17.31. Although the system's performance degrades when adding the dictionary passively thrice, the score remains significantly better than baseline. The pervasive use of the dictionary improves the baseline without the passive of the dictionary. The best performance is achieved when the dictionary is passively added four times with the pervasive use of the dictionary during decoding.

Results
The fluctuations in improvement from coupling the passive and pervasive use of an in-domain dictionary give no indication of how both approaches should be used in tandem. However, using either or both the approaches improves the translation quality of the baseline system. Table 3 presents the BLEU scores of the English to Japanese (EN-JA) translation outputs from the phrase-based SMT system on the WAT test set. Similarly, the passive use of dictionary outperforms the baseline but the pervasive use of dictionary consistently reported worse BLEU scores significantly.
Different from the JA-EN translation the pervasive use of dictionary consistently performs worse

Conclusion
Empirically, both passive and pervasive use of a in-domain dictionary to extend statistical machine translation models with lexical knowledge modestly improve translation quality. Interestingly, the fact that adding the in-domain dictionary information multiple times to the training data improves MT suggests that there may be a critical probability mass that a lexicon can impact the word and phrasal alignments in a corpus. This may provide insight on optimizing the weights of the salient in-domain phrases in the phrase table.
Although the pervasive use of dictionary information provides minimal or no improvements to the BLEU scores in our experiments, it remains relevant in industrial machine translation where terminological standardization is crucial in ensuring consistent translations of technical manuals or legal texts where incorrect use of terminology may have legal consequences (Porsiel, 2011).
The reported BLEU improvements from the passive information use of dictionary are good indication of improved machine translation quality but BLEU scores deterioration in the pervasive use only indicates that the output is not the same as the reference translation. Further manual evaluation is necessary to verify the poor performance of the pervasive use of dictionary information in machine translation.