Translation reranking using source phrase dependency features

We describe a N-best reranking model based on features that combine sourceside dependency syntactical information and segmentation and alignment information. Specifically, we consider segmentation-aware ”phrase dependency” features.


Introduction
Dependency features have been used in the past for both direct translation and reranking (Gimpel and Smith, 2013), usually in a stringto-tree or a tree-to-tree configuration. These approaches generally require the decoder to be specifically designed to produce suitable dependency structures on its output, or to use a specialized target-side parser capable of parsing potentially ungrammatical and unidiomatic sentences.
Instead, we investigated a tree-to-string Nbest reranking model suitable for use with a standard phrase-based decoder and a standard source-side dependency parser.

Source phrase dependency model
Dependency relations in a conventional dependency tree are syntactical relations between individual words. A phrase-based decoder, instead, operates in terms of phrase-pairs.
Each N-best candidate translation e i of a source sentence f is defined by its derivation, which describes how f has been segmented into source phrases, how these source phrases have been reorederd and for each source phrase which corresponding target phrase has been chosen.
In our model, we focus on the quality of phrase segmentation and reordering.

Segmentation features
The source phrases produced by the segmentation performed by the decoder do not necessarily correspond to subtrees in the dependency parse tree (or forest) g f of the sentence. And if the dependency parse is not projective, subtrees do not necessarily correspond to contiguous phrases in any possible segmentation.
We propose a set of multiple features which operate at source phrase level, inspired by the concept of phrase dependency relations of Gimpel and Smith (2013): Given a source phrasef j in a derivation, we define the set of its parent phrases PARENTS(f j ) as the set of other phrases in the same derivation which contain at least one word that is a parent of some word in f j . We also define the sets of left parents PARENTS L (f j ), right parents PARENTS R (f j ), left children CH ILDREN L (f j ) and right children CH ILDREN R (f j ). Note that only word dependency relations that cross the phrase boundaries are relevant to the definition of these phrase dependency relations.
We propose the following segmentation phrase feature functions: When phrase segmentation breaks the syntactic structures these features should be able to detect it, and the model will penalize (or perhaps reward) different types of breakages using parameters automatically learned by tuning, similarly to Cherry (2008) or Marton and Resnik (2008).

Distortion features
We consider pairs of source phrases which are aligned to target phrases that are contiguous in target order.
We also define the inversion feature function a(j − 1) > a(j) which is included both as an individual feature and in logical conjunction with each of the feature functions defined above, resulting in a total of nine boolean distortion feature functions.
These features detect reordering operations which swap syntactic structures related by a dependency relation between themselves or with a shared parent structure, similarly to the reordering operations in the synchronous dependency insertion grammar of Ding and Palmer (2005) or the syntactic coupling features of Nikoulina and Dymetman (2008).

Scoring model
The feature functions defined in the two previous paragraphs are combined into a vector which is concatenated to the feature vector produced by the decoder and multiplied by a parameter vector θ to obtain the final reranking score for each candidate translation. θ is trained using a standard machine translation tuning technique, namely K-best batch MIRA (Cherry and Foster, 2012).

Experiments
Setup We tested our model in a Italian-to-English 1000-best translation reranking task.
We trained the baseline phrase-based system using a parallel corpus assembled from Europarl v7 (Koehn, 2005), JRC-ACQUIS v2.2 (Steinberger et al., 2006) and additional bilingual articles crawled from online newspaper websites 1 , totaling 3,081,700 sentence pairs, which were split into a 3,075,777 sp. phrasetable training corpus, a 3,923 sp. tuning corpus, and a 2,000 sp. test corpus.
We trained and tuned phrase-based Moses (Koehn et al., 2007) using a "sparse features" configuration (the "word translation" and "phrase translation" feature sets described by Chiang et al. (2009)). We performed model parameter tuning using k-best batch MIRA. Non-projective dependency parse trees (actually, forests) for the Italian source sentences have been computed using the transition-based DeSR parser in tree revision configuration (Attardi and Ciaramita, 2007).

Results
The results of these experiments are shown in fig. 1.
We obtain a small but significant BLUE score improvement.
We also performed other experiments with slightly different feature function configurations but we obtained lower scores, although never lower than the baseline score of the decoder.
From a computational time point of view, the reranker adds a negligible overhead the the runtime of the decoder, even in our unoptimized Python implementation.

Conclusions and future work
We identified a set of syntactic dependency features which can provide small but significant translation quality improvements when used in N-best reranking, at least on the Italian-to-English language pair. We need to perform experiments on other language pairs to determine whether this result generalizes.
Spurious effects due to optimizer instability that can't be detected by our significance tests might be present. More advanced statistical tests such as Clark et al. (2011) should be performed to increase the confidence in the validity of our result.
In addition to reranking, our feature functions could also be used for decoding in a standard phrase-based or hierarchical translation system without a significant increase of decoding complexity, since they decompose additively over phrases or pair of phrase adjacent in target-order. Performing such experiment will be a natural extension of our work.