Novel Document Level Features for Statistical Machine Translation

In this paper, we introduce document level features that capture necessary information to help MT system perform better word sense disambiguation in the translation process. We describe enhancements to a Maximum Entropy based translation model, utilizing long distance contextual features identiﬁed from the span of entire document and from both source and target sides, to improve the likelihood of the correct translation for words with multiple meanings, and to improve the consistency of the translation output in a document setting. The proposed features have been observed to achieve substantial improvement of MT performance on a variety of standard test sets in terms of TER/BLEU score.


Introduction
Most statistical machine translation (MT) systems use sentence as the processing unit for both training and decoding. This strategy, mainly the result of pursuing efficiency, assumes that each sentence is independent, and therefore suffers the loss of missing many kinds of "global" information, such as domain, topic and inter-sentence dependency, which are particularly important for word sense disambiguation (Chan et al., 2007) and need be learned from the span of entire document. Table 1 shows the MT output of our sentence level Arabic-to-English translation engine on two sentences excerpted from a news article discussing middle-east politics. The Arabic sentences are displayed in Romanized form. The Arabic word mrsy denotes the name of the former Egyptian president Morsi in both sentences. In the first sentence it is translated together with prior word mHmd(Mohamed) as a phrase and mapped to the name correctly. In the second sentence, where no relevant local context is present, it is incorrectly translated into the word thank, which is the most frequent English word aligned to mrsy in our training data. This example shows that for ambiguous words like mrsy, utilizing only local features is insufficient to find them the correct translation hypotheses. This example also illustrates another weakness of sentence level MT. It has been observed that a word tends to keep same meaning within one document (Gale et al., 1992;Carpuat , 2009 To address these issues, this paper investigates document level features to utilize useful information from wider context. Three types of document level features, including source and target side long distance context, and "quasi-topic", are integrated into our MT system via the framework of Maximum Entropy, and lead to substantial improvement of translation performance.

A Practical Scheme to Approximate Document Level Machine Translation
Let D f denote a document in source language f consisting of N sentences: The goal of document level MT is to search the best document hypothesis D * e in target language e that maximizes the translation probability: We require that the number of sentences in D f and D e to be equal: D e =< e 1 , e 2 , ......, e N >. Using chain rule, Pr(D e |D f ) is estimated as follows: where D f ,ī denotes the source document excluding the current sentence f i , and D e,i− =< e 1 , e 2 , ......, e i−1 > which is the MT output up to previous sentence. If we keep the i.i.d. assumption of sentence generation that D f ,ī and D e,i− are irrelevant to < f i , e i >, Eq.
(2) backs off to standard sentence level translation that In our document level MT experiments, the estimate of Pr(e i |f i , D f ,ī , D e,i− ) is divided into three separate modules combined by a normalization function: (4) provides a scheme to integrate document level context features into translation process. In (4), Pr(e i |f i ) is the standard sentence level translation model, Pr(e i |D f ,ī ) models how source side long distance features, e.g. feature regarding document topic or trigger word not in current sentence, impact the generation of e i , and Pr(e i |f i , D e,i− ) can be viewed as a module exploring target side cross-sentence dependency between e i and D e,i− to maintain translation consistency. Θ(.) is a normalized combination function that incorporates the three modules together to generate a probabilistic estimate for each hypothesis. Please note that in consideration of decoding speed, the proposed scheme does not search optimal hypothesis from document space directly, but rather enhance sentence translation by utilizing "global" information not limited to current sentence f i .

Document Level Context Features
The MT system adopted in our experiments is a direct translation model that utilizes the framework of Maximum Entropy to combine multiple types of lexical and syntactic features into translation (Ittycheriah and Roukos , 2007). The model has the following form: where s is a source side word or phrase, t is the corresponding word or phrase translation, j is the transition distance from last translated word, Pr 0 is a prior distribution related to phrase to phrase translation model and distortion model, and Z is a normalizing term. In Eq. (5), feature φ k (s, t) can be viewed as a binary question regarding lexical and syntactic attributes of s and t, e.g. the question can be asked as if s and t share same POS class. Weight α k is estimated using Iterative Scaling algorithm. Testing results from many evaluation tasks have shown that the MaxEnt system performs significantly better than regular phrase system and equally well to hierarchical system. This section introduces three new types of document level features to model Pr(e i |D f ,ī ) and Pr(e i |f i , D e,i− ). All the three types of features can be expressed as a triplet that φ k (s, t) =< s, c, t >, where c denotes a source or target side context word, identified from the span of entire document, which works as a bridge to connect s and t. Please note that φ k is still a binary feature which indicates if a particular context word c of certain type exists for s and t.

Source Side Long Distance Context Feature
The first type of document level feature is motivated by the example shown in Table 1. The ambiguous Arabic word mrsy in the second sentence is mistranslated to English word thank because there is no local evidence to suggest it is a person name rather than a verb which is more common in training data and thus has higher translation probability in prior phrase model Pr 0 . In this case, if the words co-occurring with mrsy in the first sentence, i.e. mHmd(Mohamed), can be identified and passed to subsequent sentences, the probability of mrsy in the same document being translated into Morsi is likely to be increased. φ LDC , the long distance context (LDC) feature, is implemented as follows in training stage. Suppose the questioned source word w f occur in sentence i with translation w e . To identify the relevant LDC word c f , the entire document excluding current sentence i is analyzed to find if the alignment (w f , w e ) also occurs in other sentences. If yes, the source words within a window centered by w f at that place are collected as the candidates for c f . For instance, if the two Arabic sentences of Table 1 are in training data, the words mHmd(Mohamed) and AlmSry(Egyptian) in the first sentence will be viewed as the LDC word for mrsy in the second sentence, which results in two φ LDC features i.e. < mrsy, mHmd , Morsi > and < mrsy, AlmSry, Morsi >. As illustrated in Eq. (5), the two features can boost the translation probability of Pr(Morsi|mrsy) for entire document if their weights are properly learned.
In training stage the check of aligned target word w e is to ensure that only words with same meaning can be grouped together to share context features. In decoding where true w e is unknown, we only use w f instead of (w f , w e ) to identify LDC word c f . In our experiment function words are not allowed to be c f , and tf * idf score is used to filter out irrelevant context word.

Target Side Long Distance Context Feature
In order to improve the consistency of word choice in hypothesis generation, φ LDC can be extended to target side to utilize the correlation between D e,i− , the translation up to previous sentences, and e i , the translation of current sentence. In training stage, φ tLDC , the target side long distance context (tLDC) feature, is implemented in the following way. For a questioned source word w f which is aligned to target word w e in sentence i , we search their occurrence in all previous sentences from 1 to i -1. If exists, the target side words within the window centered by w e in that sentence are identified as the candidates of tLDC word c e for w f . For the example used before, the English side words Mohamed and president are expected to make φ tLDC features for the word mrsy in the second sentence.
The feature in decoding stage is implemented similarly by remembering previous translation D e,i− and its alignment to source words. Please note we don't use the hypothesized translation w e itself as the tLDC word for w f . This is because if it is an incorrect translation, such error can be spread to subsequent sentences to cause duplicated errors.

LSA based Quasi-Topic Feature
LDC and tLDC features are effective for repeated words. For words occurring once in a document, quasi-topic (QT) feature φ QT is proposed as a back-off model which utilizes underlying topic information to eliminate ambiguity for these words.
In training stage, Latent Semantic Analysis (LSA) is performed on bilingual corpus consist-ing of a large set of documents with parallel sentences. Both source and target side words are mapped to vectors locating in a unified high dimensional space. For a questioned word w f , its QT feature words are selected as follows. First all source side content words in the same document are calculated tf * idf score, and sorted by their values from high to low. The top L words are then collected as the indicators of the underlying document topic. Next semantic similarity is measured between w f and each of the L candidates based on Cosine metric. Only words showing strong correlation are selected as the QT feature c t for w f .
In decoding stage, MaxEnt model, as shown in Eq. (5), is utilized to estimate the probability of a hypothesis w e being generated from w f and QT feature words c t . Our preliminary experiments found MaxEnt model performs better than commonly used vector based similarity metric. Generally speaking the QT features provide an implicit way for topic adaptation. When applied to translation, it changes the lexical distribution of target words to prefer the one more relevant to the hidden topic represented by c t .

Related Work
Recent years witness a growing interest in exploiting long distance dependency to improve MT performance (Wong and Kit , 2012;Hardmeier et al., 2013). Domain adaptation and topic adaptation have attracted considerable attentions (Eidelman et al., 2012;Chen et al., 2013;Hewavitharana et al., 2013;Xiong and Zhang, 2013a;Hasler et al., 2014). There are also efforts that explore lexical cohesions with the help of WordNet to describe semantic co-occurrence from document span (Ben et al., 2013;Xiong et al., 2013b). Translation consistency, related to the observation of one sense per discourse (Gale et al., 1992;Carpuat , 2009;Guillou , 2013), has been discussed recently as an additional metric to evaluate translation quality (Xiao et al., 2011;Ture et al., 2012). There are also efforts for Arabic proper name disambiguation (Hermjakob et al., 2008). This paper investigates novel document level features to utilize lexical and semantic dependencies between sentences. In contrast to (Ben et al., 2013;Xiong et al., 2013b), our work doesn't need external resources e.g. WordNet or human efforts to identify word cohesion and isn't limited to certain word type. The advantage makes the proposed  features more suited to low-resource languages.

Experiments
Our system is primarily built for an Arabic dialect to English MT task. The training data contains LDC-released parallel corpora for the BOLT project. There are totally 6.9M sentence pairs with 207M Arabic ATB tokens and 201M English words, respectively.  (Hopkins and May , 2011) to minimize the score of (TER-BLEU).
We select NIST Arabic MT03-MT09 as the test sets. Results are shown in Table 2. The two numbers in each score column are TER followed by BLEU. The best performance is illustrated in bold. The result of MT system using only sentence level features is listed as the baseline. The integrations of the three document features are denoted as +LDC, +tLDC and +QT, respectively. Table  2 shows that substantial improvements of translation quality, measured by both TER and BLEU, are achieved for most of the test sets.
To understand the effectiveness of document features on different type of data, we further split MT09 set into newswire and weblog, and conduct test on them. Table 3 shows that long distance context features, φ LDC and φ tLDC , perform better on newswire than on weblog respecting to the relative improvement of TER and BLEU. One reason to explain this is that the rate of content word repe-tition is different on the two types of data. According to our calculation, about 19% content words in newswire repeat themselves while the ratio on weblog is about 13%.  Table 3: MT performance on MT09 newswire and weblog in terms of TER and BLEU. Table 4 shows the new MT output of the two example sentences. Three LDC features are fired for mrsy in the 2nd sentence: mrsy,AlmSry,Morsi , mrsy,mHmd,Morsi and mrsy,ySf,Morsi where the 3rd one is a false alarm. Items in the triplets correspond to source word, context word and hypothesized word, respectively. Three tLDC features are also fired including mrsy,Egyptian,Morsi , mrsy,Mohamed,Morsi and mrsy,describes,Morsi where the 3rd one is also a false alarm. To our surprise, word Alr}ys and its translation president aren't fired as context feature. Analysis found that this is due to the fact that our LDC training data was collected before Dr. Morsi was elected as president in 2012. Therefore no relevant feature is learned into MaxEnt model. AR: Alr}ys AlmSry AlmEzwl mHmd mrsy ysf nfsh b nh r}ys Aljmhwryp MT: The deposed Egyptian president Mohamed Morsi describes himself as the president of the republic AR: mrsy ytHdY AlqADy fy mHAkmth bthmp Alhrwb mn Alsjn MT: Morsi defies the judge in his trial on charges of escaping from prison