Document-Level Machine Translation Evaluation with Gist Consistency and Text Cohesion

Current Statistical Machine Translation (SMT) is signiﬁcantly affected by Machine Translation (MT) evaluation metric. Nowadays the emergence of document-level MT research increases the demand for corresponding evaluation metric. This paper proposes two superior yet low-cost quantitative objective methods to enhance traditional MT metric by modeling document-level phenomena from the perspectives of gist consistency and text cohesion. The experimental results show the proposed metrics can obtain better correlation with human judgments than traditional metrics on evaluating document-level translation quality.


Introduction
Since most of current SMT models impose strong independence assumptions on words and sentences, most of these systems only work at sentence level and cannot employ useful relationships among sentences during decoding. However, a text rather than individual words or fragments of sentences is the basic unit of communication (Al-Amri, 2007). Beaugrande and Dressler (1981) define that text is a communicative occurrence which meets seven standards, such as textuality cohesion, coherence. Text is constituted by sentences, but there exist separate principles of text-construction beyond the rules for making sentences (Fowler, 1991).
Document is the carrier of text in modern computer system. Currently more researching work focus on document-level SMT (Tiedemann, 2010;Xiao et al, 2011;Gong et al, 2011;Ture et al., 2012;Hardmeier et al., 2012;Xiong et al, * *Corresponding author. 2013). However, most of these researches show their improvements by using system-level metrics, such as BLEU (Papineni et al., 2002). Whether improvements in performance at system level are really able to reflect the change of text-level translation quality is still to doubt.
Nowadays, the study of real document-level MT metrics has been drawing more and more attention. Based on Discourse Representation Theory (Kamp and Reyle, 1993), Gimenez et al. (2010) propose to use co-reference and discourse relations to build evaluation metrics. The metrics by extending traditional metrics with lexical cohesion devices show some positive experimental results (Wong and Kit, 2012). Bilingual topic model (Blei et al., 2003) is applied to do MT quality estimation (Raphael et al., 2012;Raphael et al, 2013). Guzman et al. (2014) use two discourse-aware similarity measures based on discourse structure to improve existing MT evaluation metrics.
According to the afore-mentioned definition of text, the most important standard of evaluating translation quality for one document should be to what degree the MT output correctly communicates the main idea of origin text. From this regard, this paper first proposes to measure gist consistency of text via topic model. Topic model is a statistical model which assumes each document can be characterized by a particular set of topics. Currently a variety of probabilistic topic models (Landauer et al., 1998;Hofmann, 1999;Blei et al., 2003) have been used to analyze the content of documents and the meaning of words. Our experimental results show the MT evaluation metrics with robust topic model can effectively capture change of translation quality between reference and MT output at document level.
Furthermore, cohesion and coherence are important standards of textuality. Coherence interprets meaning connectedness in the underlying text while cohesion can be formulated quite explicitly on the basis of grammatical and lexical properties (Halliday and Hasan, 1976). This paper describes a simple yet effective cohesion function to measure text cohesion via lexical chain. Our experimental results show that the number of matching lexical chain between reference and MT output can reflect the goodness of translation at document level.
The rest of this paper is organized as follows: Section 2 and 3 respectively describes how to model two kinds of document-level features. Section 4 shows the framework of combing document-level scores with traditional metrics. Section 5 presents the experimental results and Section 6 gives out discussion. Finally, we conclude this paper in Section 7.
2 Gist Consistency Score based on Topic Model Reeder (2006) proposes to measure MT adequacy at the document level with Latent Semantic Analysis (LSA) (Landauer et al., 1998). However, Reeder only uses a set of complex configuration to show the close correlation between LSA model and human assessments and does not suggest how to use it to design an evaluation metric. Raphael et al. (2012; exploit bilingual topic models to do quality estimation (without references) for machine translation. In this study, since each evaluation document has 4 references, we show a simple way to design document-level metrics with monolingual topic model.

Topic Model
LDA (Blei et al., 2003) is one of the most common topic models which assumes each document is a mixture of various topics and each word is generated with multinomial distribution conditioned on a topic. We use an off-the-shelf LDA tool 1 to train a topic model with 86070 news (happened in 2004 year) documents coming from the Xinhua portion of the Gigaword corpus (LDC2005T12). A trained LDA model produces two kinds of distributions: the "document-topic" distribution and the "topic-word" distribution. Suppose there are K topics, the k-th dimension P (z = k|d) means the probability of topic k given document d. The whole document-topic distribution over K topics for one document d, denoted as P (Z|d), can be represented by a K-dimension vector. In this study, when K set to 120, the trained LDA model can be tuned with the minimal perplexity (Blei et al., 2003).

Measure of Topic Consistency
After constructing a trained topic model, the "document-topic" distribution of MT output and reference on evaluation dataset (see Section 5.1) can be respectively inferred.
We use Kullback-Leibler divergence to measure topic consistency between MT output and reference with the basic unit of document. Denote the "document-topic" distribution of one reference (d r ) as P (Z|d r ), and the one of its MT output (d t ) as Q(Z|d t ), the KL divergence of Q from P is defined to be: In theory, G should keep same to the value of the trained LDA model (K = 120). However our initial experiment results show the hybrid METEOR has a drop on adequacy on evaluation dataset by using a static G.
To address such problem, we output the number of topics whose document-topic probability is great than 0.01 (called as valid topic) for each reference document and found the range of this number is [7,31]. Obviously the inferred topic model contains plenty of noise topics and we need measure valid topic rather than all topics consistency for each document.
Therefore, before computing topic consistency, we first record the IDs of valid topics for one reference, then obtain corresponding "document-topic" probability of evaluation document according to these topic IDs. Thus, in this study, G is dynamically set according to the number of valid topics of each reference.
There are 4 references per document in evaluation data.
One machine translated document is scored against each reference independently, and the minimal D KL is used. The score of topic consistency for each evaluation document, denoted as S topic , is computed by the following formula:

Cohesion Score based on Simplified Lexical Chain
Text adequacy is the most important standard for the purpose of successful communication.
According to the work of Wong and Kit (2012), cohesion is another important element to organize text. They found: SMT systems tend to use less lexical cohesion devices than those of human translators. Here lexical cohesion devices mainly refer to content words reiterating once or more times in a document. They propose to build document-level MT metrics by integrating cohesion score based on lexical cohesion devices. However, Carpuat and Simard (2012) draw a different conclusion: MT output tend to have more incorrect repetition than human translation when the MT model is especially trained on smaller corpora. Suppose these incorrect repetition as "false" cohesion, metrics in (Wong and Kit, 2012) will fail to distinguish such "false" cohesion devices.
In our opinion, the lack of Wong's work is completely ignoring text cohesion of references, and they only model the cohesion score of MT output. In this study, we assume the correct cohesion of MT output should be consistent with the one of references. Reference is the equivalent of its source text. The MT output might be cohesive only if source text is cohesive, so the assumption is reliable.
In this paper, we implement such assumption via a special structure, simplified lexical chain.

Simplified Lexical Chain
Differing from lexical chain in these work (Morris and Hirst, 1991;Galley and McKeown, 1993;Xiong et al, 2013) which is the sequence of semantically related words based on special thesaurus, our lexical chain refers to reiterating words including stem-matched words. Furthermore, it only records position information for each content word. Our lexical chain is simpler and might gain broader use because it doesn't require special thesaurus, such as WordNet and HowNet. Thus, we call such lexical chain as simplified lexical chain.
The detailed establishing procedure of simplified lexical chain is described in our another work (Gong and Zhou, 2015). The key of this procedure is to assure that each content word occurring at different sentences one more time is  Figure 1 shows a lexical chain LC1 for the word "die" (perhaps with different morphology) and it records that "die" occurs at the 1st, 2nd and 3rd sentence. One document often contains several lexical chains, thus a hash table ht is utilized to organize all these chains.
For clarity, ht is called as lexical-chain index. In this hash table, keys are content words and values refer to lexical chains.

Cohesion Score
We constructed lexical-chain index for each document on our evaluation data, including 4 human translations (references) and all MT output on evaluation corpus in advance. Due to high flexibility of natural language utterances, few lexical chains from MT output can completely match the ones from its references. So we design a special function that permits incomplete matching to score text cohesion .
Suppose the lexical-chain index in reference and in MT output as ht ref and ht mt , we can find a pair of matching lexical chain of ht ref and ht mt , denoted as LC r and LC t . LC r contains m elements and LC t contains n elements, but only m (m <= m) elements both occur in LC r and LC t , then the cohesion score of LC t can be calculated by the following formula: CS i only refers to one pair of matching chain. If one chain of MT output cannot be found in its reference, the chain is invalid ("false"). Suppose ht mt contains K lexical chains, we punish such "false" cohesion by averaging K. Given the number of matching chain is L, the final cohesion score assigned to ht mt is calculated as follows: We choose the best Doc cs for one MT output against 4 references.

Traditional MT Evaluation Metrics
For fair comparison and possible integration of our proposed document-level features, this section gives a brief introduction on two widely adopted MT evaluation metrics: BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005). As the most famous evaluation metric, BLEU is based on n-gram matching. Given a system translation, BLEU first collects all n-grams and count how many of them exist in one or more references (sentence by sentence), and then integrate the precisions of n-grams with different lengths into one score as follows: where p n is the precision of n-gram and BP is a penalty factor, preventing BLEU from favoring short segments due to the lack of direct consideration of recall. It is obvious that, although BLEU takes all n-grams into consideration, the importance of different n-grams is ignored except their lengths.
METEOR is based on unigram alignment of references and MT output. Each unigram in one system translation is at most mapped to one unigram in the references first and then three successive stages of "exact", "porter stem" and "WN synonymy" are used to create alignment in turn. Once the final alignment is produced, unigram precision (P ) and recall (R) are calculated and combined into one F mean score: Finally, the METEOR score is obtained as follows: Where pen is a penalty factor. METEOR is explicitly designed to improve the correlation with human judgments of MT quality at the sentence level and the performance of METEOR outperforms BLEU at sentence level. Based on the formula 5 or 7, document-level BLEU/METEOR score can be generated by aggregating sentences in a document rather than simply averaging scores at sentence level.

The Combining Framework
Gist consistency and text cohesion refer to top-level characteristics of text while traditional MT evaluation metrics, such as document-level BLEU, show the degree to which the n-grams also occur in the MT output. Inspired by the work of Wong and Kit (2012), we construct document-level metric by extending traditional metric with aforementioned two kinds of document-level scores as where G m doc refers to document-level BLEU or METEOR score (one score per document), S m doc to gist consistency score(S topic ) or text cohesion score(Doc cs ) proposed in this paper. α and β are weights which are tuned on MTC2 evaluation dataset (see Section 5.1) by a gradient ascending algorithm with the optimum goal of maximum correlation value (Liu and Gildea, 2007). Besides, each machine translated sentence on the MTC4 and MTC2 was evaluated by 2 to 3 human judges for their adequacy and fluency on a 5-point scale.

Evaluation Data
To avoid the bias in the distributions of different judges' assessments in the evaluation data, we normalize the scores following Blatz et al. (2003).
It is worth noting that, due to the lack of document-level human assessments on the two evaluation dataset, document-level human assessments are averaged over sentence scores, weighted by sentence length. This method is also adopted by famous MetricsMaTr (the NIST Metrics for Machine Translation Challenge) and approximated in Gimenez et al. (2010) and Wong and Kit (2012).

The Performance of Extending Metrics
In this study, Pearson and Kendall coefficients are both used to formulate correlation following the way of MetricsMaTr. It noted, Pearson ranges from -1 to 1 with 1 for total positive correlation, 0 for no correlation and -1 for total negative correlation, while Kendall ranges from 0 to 1 with 0 for no agreement and 1 for complete agreement. The document-level BLEU and METEOR scores (one score per document) are first obtained via the NIST BLEU script (version 13) and the METEOR toolkit 1.4. The correlation between traditional metrics and human judgements is shown in Table 2.
After introducing gist consistency score into traditional MT metrics, the Kendall correlation between the hybrid BLEU (HBLEU(s topic )) and human judgements rise from 42.56% to 48.66% on adequacy on MTC4, and with a similar increase on MTC2. The Kendall correlation of the hybrid METEOR (HMETEOR(s topic )) scores also obtain a significant rise (0.8%-1.4%) both on MTC4 and MTC2.
After introducing cohesion score into traditional metrics, the Kendall correlation between the hybrid BLEU (HBLEU) and human judgements rise from 42.56% to 48.00% on Kendall score on MTC4 and with a similar increase on MTC2. Furthermore, differing with the results in Wong's work, our hybrid METEOR (HMETEOR) scores also obtain a moderate rise (0.64%-0.67%) both on MTC4 and MTC2.
It seems gist consistency outperforms text cohesion on evaluating document-level MT output. It is worth noting the α and β is 1.47 and 0.51 on methods of combing gist consistency score with METEOR. The α and β is 1.82 and 0.02 on methods of combing text cohesion score with METEOR. It seems that cohesion score only plays a minor role on improving METEOR in this study. We think the approximated document-level human judgments may be the major reason (see section 5.1).

The Impacts of Associating Gist Consistency with Text Cohesion
In this paper, Gist consistency is obtained based on LDA topic model that uses representative term for major topics existed in one document, and the training procedure of LDA actually relies on term repetition. Text cohesion is obtained based on simplified lexical chain which also depends on iterating words. In a sense, both of these measures are based on same kind of information (although measured differently). It would be interesting to see whether BLEU or METEOR with their combination can increase performance or not.
According to the results shown in Table 3, both document-level BLEU and METEOR enhanced with the combination of gist consistency and text cohesion is subordinate to its corresponding metrics only with gist consistency. BLEU with such combination is still superior to its enhanced metrics only with text cohesion while METEOR with such combination has a slight drop compared with its enhanced metrics only with text cohesion.  Table 3: The Kendall correlation between human judgments and the proposed metrics with the combination of gist consistency and text cohesion METEOR uses WordNet to help evaluation, so METEOR can utilize synonym information. In this paper, LDA model utilize an additional large training corpora (see section 2), thus it may contain synonym information in some topics. Furthermore, we only focus on major topics of one document, which may help METEOR highlight some important words in the scope of documents.
In this study, the performance of METEOR with text cohesion has a slight improvement since our lexical chain ignores synonym for the general purpose. However, using different target words to translate the same source word in different context is common. In the future work, we will  Table 2: The correlation between the proposed metrics combining with gist consistency/text cohesion with human judgments build lexical chain by introducing synonyms. Furthermore, it noted that one additional weight of formula 8 needs to be tuned with the gradient ascending algorithm, and it might be the another reason for degrading the performance.

The Characteristic of Text Cohesion based on Simplified Lexical Chain
We output the lexical chains on two evaluation dataset shown in Table 4. On MTC4, the average number of chains extracted from references (2111) is really more than the one of evaluated documents (1999), which is consistent to the observation in Wong's work. But such observation is not true on MTC2. Table 4 also shows each MT system on MTC2 produces more lexical chains (2380) than the average number of its reference (2030).

Genres
Item Data

Conclusion
We describes two kinds of document-level measures and successfully use them to construct document-level evaluation metrics. Hybrid metrics based on topic model can produce significant positive impacts when given a robust trained topic model. Since important words will be repeated in one text, lexical chains can not only model text cohesion but also highlight key words. So our proposed metrics can obtain very significant improvement for BLEU and also give might improvement for METEOR. Furthermore, hybrid metrics based on text cohesion has less limitation than topic-based method since it doesn't need additional training data, and it can be easily integrated into existing traditional metrics.
In the future, we will explore how to model more document-level features, such as co-reference matching, and hope our study can bring more inspirations to document-level SMT.