Improving evaluation and optimization of MT systems against MEANT

We show that, consistent with MEANT-tuned systems that translate into Chinese, MEANT-tuned MT systems that translate into English also outperforms BLEU-tuned systems across commonly used MT evaluation metrics, even in BLEU. The result is achieved by significantly improving MEANT’s sentence-level ranking correlation with human preferences through incorporating a more accurate distributional semantic model for lexical similarity and a novel backoff algorithm for evaluating MT output which automatic semantic parser fails to parse. The surprising result of MEANT-tuned systems having a higher BLEU score than BLEU-tuned systems suggests that MEANT is a more accurate objective function guiding the development of MT systems towards producing more adequate translation.


Introduction
showed that MEANT-tuned system for translating into Chinese outperforms BLEU-tuned system across commonly used MT evaluation metrics, even in BLEU. However, such phenomena are not observed in MEANT-tuned system for translating into English. In this paper, for the first time, we present MT systems for translating into English, which is tuned to a improved version of MEANT, also outperforms BLEU-tuned system across commonly used MT evaluation metrics, even in BLEU. The improvements in MEANT include incorporating more accurate distributional semantic model for lexical similarity and a novel backoff algorithm for evaluating MT output which the automatic semantic parser failed to parse. Empirical results show that * This work was completed at HKUST. the new version of MEANT is significantly improved in terms of sentence-level ranking correlation with human preferences.
The accuracy of MEANT relies heavily on the accuracy of the model that determines the lexical similarities of the semantic role fillers. However, the discrete context vector model based on the raw co-occurrence counts used in the original proposal of MEANT does not work well in predicting the similarity of the lexicons used in the reference and machine translations. Recent work by Baroni et al. (2014) shows that word embeddings trained by predict models outperforms the count based models in various lexical semantic tasks. Baroni et al. (2014) argues that predict models such as word2vec (Mikolov et al., 2013) outperform count based models on a wide range of lexical semantic tasks. It is also common knowledge that raw co-occurrence counts do not work very well and performance can be improved when transformed by reweighing the counts for context informativeness and dimensionality reduction. In contrast to conventional word vector models, prediction based word vector models estimate the vectors directly as a supervised task, where the weights in a word vector are set to maximize the probability of the contexts in which the word is observed in the corpus (Bengio et al., 2006;Collobert and Weston, 2008;Collobert et al., 2011;Huang et al., 2012;Mikolov et al., 2013;Turian et al., 2010).
In this paper, we show that MEANT's correlation with human adequacy judgments can be further improved by incorporating the word embeddings trained by the predict models. Subsequently, tuning MT system against the improved version of MEANT produce more adequate translations than tuning against BLEU. Figure 1: Examples of automatic shallow semantic parses. Both the reference and machine translations are parsed using automatic English SRL. There are no semantic frames for MT3 since there is no predicate in the MT output.
tic frames and role fillers in the reference and machine translations. MEANT typically outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgment, and is relatively easy to port to other languages, requiring only an automatic semantic parser and a monolingual corpus of the output language, which is used to train the discrete context vector model for computing the lexical similarity between the semantic role fillers of the reference and translation.  describe a cross-lingual quality estimation variant, XMEANT, capable of evaluating translation quality without the need for expensive human reference translations, by utilizing semantic parses of the original foreign input sentence instead of a reference translation. MEANT is generally computed as follows: 1. Apply an automatic shallow semantic parser to both the reference and machine translations. (Figure 1 shows examples of automatic shallow semantic parses on both reference and MT.) 2. Apply the maximum weighted bipartite matching algorithm to align the semantic frames between the reference and machine translations according to the lexical similarities of the predicates.
3. For each pair of the aligned frames, apply the maximum weighted bipartite matching algorithm to align the arguments between the reference and MT output according to the lexical similarity of role fillers.

4.
Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the following definitions: MEANT = 2 · precision · recall precision + recall where w pred and w j are the weights of the lexical similarities of the predicates and role fillers of the arguments of type j of all frame between the reference translations and the MT output. There is a total of 12 weights for the set of semantic role labels in MEANT as defined in Lo and Wu (2011b). The value of these weights are determined in supervised manner using a simple grid search to optimize the correlation with human adequacy judgments (Lo and Wu, 2011a) for MEANT and in unsupervised manner using relative frequency of each semantic role label in the references for UMEANT   , as shown in above, it uses f-score to aggregate individual token similarities into the phrasal similarities of semantic role fillers. Another MEANT's variant, IMEANT , which uses ITG to constrain the token alignments between the semantic role fillers of the reference and the machine translations and is shown outperforming MEANT .

Improvements to MEANT
We improve the performance of MEANT by incorporating a word embedding model for more accurate evaluation of the semantic role filler similarity and a novel backoff algorithm for evaluating translations when the automatic semantic parser fails to reconstruct the semantic structure of the translations. Our evaluation results show that the new version of MEANT is significantly improved in correlating with human ranking preferences at both the sentence-level and the document-level.

Discrete context vectors vs. word embeddings
MEANT's discrete context vector model is very sparse because of the extremely high dimension of the discrete context vector model. The number of dimensions of a vector in the discrete context vector model is the total number of token types in the training corpus. The vector sparsity issue makes the lexical similarity highly sensitive of exact token matching and thus hurts the accuracy of MEANT. We aim at tackling the sparse vector issue by replacing the discrete context vector model with the continuous word embeddings in order to further improve the accuracy of MEANT. We first train the word embeddings on the same monolingual corpus as the discrete context vector model, i.e. Gigaword, for a fair comparison. However, since the memory consumption of the word embeddings is significantly reduced when comparing with the discrete context vector model due to the reduced dimension in the vectors, it is now possible to increase the size of the training corpus of the word embeddings so as to improve the token coverage of the lexical similarity model. We compare the in-house Gigaword word embeddings which covers 1.2 million words and phrases with the Google pretrained word embeddings (Mikolov et al., 2013) that is trained on a 100 billion tokens news dataset and covers 3 million words and phrases. We show that the high portability of MEANT is preserved when replacing the discrete context vector model with word embeddings as the size of the monolingual training data for the word embeddings does not significantly affect the correlation of MEANT with human adequacy judgments.
Another interesting property of the word embeddings is the compositionality of words vectors into phrases. As described in Mikolov et al. (2013), for example, the result of linear vector calculation vec("Madrid")-vec("Spain")+vec("France") is closer to vec("Paris") than to any other vectors. It seems to be natural that phrasal similarity of the semantic role fillers could be more accurately computed using the composite phrase vector than using the align-and-aggregate approach because the vector composition approach is not affected by the errors of token misalignment. However, we show that surprisingly, the align-and-aggregate approach outperforms the naive linear word vector composition in computing the phrasal similarities of the semantic role fillers.

Backoff algorithm for evaluating translations without semantic parse
MEANT fails to evaluate the quality of the translations if the automatic semantic parser fails to reconstruct the semantic structure of the transla-tions. According to the error analysis in  , the two main reasons for the automatic shallow semantic parser failing to identify the semantic frames are the failure to identify the semantic frames for copula or existential senses of "be" in a perfectly grammatical sentence and the absence of any predicate verb at all in the sentence. They showed that manually reconstructing the "be" semantic frames for MEANT yields significantly higher correlation with human adequacy judgment. Thus, we present a novel backoff algorithm for MEANT to reconstruct the "be" semantic frame and evaluate the whole sentence using the lexical similarity function and weigh it according to the ratio of unlabeled tokens in the MT/REF.
The reconstruction of the "be" semantic frame is triggered when the automatic shallow semantic parser fails to find a semantic frame in the sentence. It utilizes the syntactic parse of the sentence and labels the verb-to-be as the predicate. Then, it labels the constituent of the NP subtree sibling immediate left to the predicate as the "who" role, the constituent of the NP subtree sibling immediate right to the predicate as the "what" role and any constituent of other subtree siblings of the predicate as "other" role. The reconstructed "be" frame is then evaluated the same way as other semantic frames using MEANT.
When there is no predicate verb in the whole sentence, we evaluate the whole sentence using the lexical similarity function and weighted according to the amount of unlabeled tokens in the MT/REF. Thus, equation (1), (2) and (3) are replaced by equation (4), (5) Note that we have also introduced the weight α for the precision and recall. Later, we show that optimal value of α for MT evaluation is different from that for MT optimization. Table 1 shows the document-level Pearson's score correlation and table 2 shows the sentence-level Kendall's rank correlation with human preferences of the improved version of MEANT with the previous version of MEANT  on WMT2014 metrics task test set (Macháček and Bojar, 2014). For the sake of stable performance across all the tested language pairs, the weights of the semantic role labels are estimated in unsupervised manner.

Results
First and the most importantly, the documentlevel score correlation with human preferences of all versions of MEANT consistently outperforms all the submitted metrics in Macháček and Bojar (2014). While the variations on document-level correlation with human preferences of different versions of MEANT are not significant, we focus on discussing about the sentence-level results.
On sentence-level ranking, MEANT with Gigaword word embeddings correlates significantly better with human preference than MEANT with Gigaword discrete context vectors. Although the Google pretrained word embeddings covers more than twice as many token types as the Gigaword word embeddings, our results show that MEANT incorporated with the Google pretrained word embeddings only marginally better that incorporated with the Gigaword word embeddings. Our results show that MEANT's portability to languages with lower resources is preserved as MEANT with Gigaword word embeddings achieves comparable accuracy without using huge amount of resources.
While the linear vector composition property of word embeddings receive a lot of attention recently, our results show that, surprisingly, MEANT with word embeddings using the alignand-aggregate approach in computing the phrasal similarities significantly outperforms that using the simple linear vector composition across all language pairs in the test set. Our results suggest that more investigation on using word embeddings is necessary for it to be useful for efficient evaluation of phrasal similarities.
Our results also show that MEANT with an α value of 1, i.e. recall only, significantly outperforms that with balanced precision and recall weighting, in correlation with human preferences. This could be due to the fact that MT sys-   tems tend to under-generate (i.e. missing meaning in the translation output) rather than over-generate. This also explains why the precision-oriented metrics, such as BLEU, usually correlate poorly with human adequacy judgments. Lastly, our results show that the novel backoff algorithm significantly improves MEANT's correlation with human preferences. Lo et al. (2013b) show that for MT system translating into Chinese, tuning against MEANT outperforms the common practice of tuning against BLEU or TER across commonly used MT evaluation metrics, i.e. beating BLEU-tuned systems in BLEU and TER-tuned systems in TER. However, for MT system translating into English, previous work (Lo et al., 2013a; show that tuning against MEANT only achieves balanced performance in both n-gram based metrics and edit distance based metrics, without overfitting to either type of metrics. We argue with the significant improvement in sentence-level correlation with human preferences in evaluating translations in English, the performance of MT system tuned against the newly improved MEANT would also improved.

Tuning against the new MEANT
For WMT2015 tuning task, we tuned the basic Czech-English baseline system against the newly improved MEANT using the official development set and k-best MERT (with 100-best hypothesis list). Unfortunately, there is a bug in the integration of MEANT and Moses k-best MERT in the submitted system. Table 3 and 4 shows the results of both the submitted buggy system and the debugged version of the experiments on the official dev and test test.
In the previous section, MEANT with an α value of 1, i.e. 100% recall, has the highest correlation with human preferences on the test set. However, surprisingly, our tuning experiment results show that tuning against a balanced precision-recall version of MEANT yields better scores across the commonly used MT evaluation metrics. This is because the optimization algorithm needs the guidance from precision to avoid blindly generating too many words which would achieve high recall.
More importantly, our results show that MT system tuning against the improved MEANT beats the BLEU-tuned system across the commonly used MT evaluation metrics, even in BLEU.

Related Work
Most of the common used MT evaluation metrics like BLEU (Papineni et al., 2002), NIST (Doddington, 2002), CDER (Leusch et al., 2006), WER (Nießen et al., 2000), and TER (Snover et al., 2006) rely heavily on the exact match of the surface form of the tokens in the reference and the MT output. Thus, they do not only fail to capture the   Owczarzak et al. (2007a,b) improved the correlation with human fluency judgments by using LFG to extend the approach of evaluating syntactic dependency structure similarity in Liu and Gildea (2005), but did not improve the correlation with human adequacy judgments when comparing to METEOR. Similarly, TINE, an automatic recall-oriented basic meaning event structured based evaluation metric (Rios et al., 2011) correlated with human adequacy judgment comparable to that of BLEU but not as high as that of METEOR. ULC Màrquez, 2007, 2008) incorporates several semantic similarity features and shows improved correlation with human judgement of translation quality (Callison-Burch et al., 2007;Giménez and Màrquez, 2007;Callison-Burch et al., 2008;Giménez and Màrquez, 2008) but no work has been done towards tuning an MT system using a pure form of ULC perhaps due to its expensive run time.
By incorporating word embeddings into MEANT, translations are evaluated via both the structural and lexical semantics accurately and thus, MT system tuned against the improved MEANT beats BLEU-tuned system across commonly used metrics, even in BLEU.

Conclusion
In this paper we presented the first results of using word embeddings to improve the correlation with human adequacy judgments of MEANT, the stateof-the-art semantic MT evaluation metric. We also showed that using a smaller and easy-to-obtain monolingual corpus (e.g., Gigaword, Wikipedia) for training the word embeddings does not significantly affect the accuracy of MEANT. We showed that the align-and-aggregate approach outperforms the naive linear word vector composition, although the compositional property is highly advertised as the advantage of using word embeddings. We also described a novel backoff algorithm in MEANT for evaluating the meaning accuracy of the MT output when automatic shallow semantic parser fails to parse the sentence. In this tuning shared task, we successfully integrate MEANT with the Moses framework. This enable further investigation into tuning MT system against MEANT using newer tuning techniques and features. Most importantly, we show that tuning MT system against the improved version of MEANT outperforms BLEU-tuned system across all commonly used MT evaluation metrics, even in BLEU.