EED: Extended Edit Distance Measure for Machine Translation

Over the years a number of machine translation metrics have been developed in order to evaluate the accuracy and quality of machine-generated translations. Metrics such as BLEU and TER have been used for decades. However, with the rapid progress of machine translation systems, the need for better metrics is growing. This paper proposes an extension of the edit distance, which achieves better human correlation, whilst remaining fast, flexible and easy to understand.


Introduction
Machine Translation (MT) has been a popular research topic for the past few years. It deals with the paradigm of how to automatically translate a sentence or a set of sentences from a source language to a different target language. In statistical MT, this can be formally described as finding the translation e I 1 = e 1 . . . e i . . . e I with the highest probability for a given source language sentence f J 1 = f 1 . . . f j . . . f J : This approach models the translation task by defining it as a search for the sentence that best suits a given criterion. For example through loglinear models as described by Och and Ney, 2002. However, all approaches have to be evaluated to quantify the quality and accuracy of the produced translations. Naturally, the best method would be to have human experts rate each produced translation in order to evaluate the whole MT system. This is quite a costly process and is not viable for development of MT systems. For this reason a number of metrics exist that automate the process and use different scoring methods to automatically evaluate the produced translation based on a reference sentence. Two of the earliest and most pop-ular metrics are BLEU (Papineni et al., 2002) and TER (Snover et al., 2006). This paper introduces a new MT metric: Extended Edit Distance (EED), based on an extension of the Levenshtein distance (Levenshtein, 1966). This metric follows a number of criteria: • It is bound between zero and one.
• Its definition is kept simple, as it does not depend on external dictionaries or language analysis.
• It has competitive human correlation.
• It is fast to compute.
The remainder of this paper is structured as follows: first, related work is reviewed in Section 2; Section 3 introduces the concept of edit distance and the different existing extensions of it; Section 4 introduces the EED metric in detail; A comparison with other metrics regarding human correlation and speed is performed in Section 5; Finally, a conclusion is drawn in Section 6.

Background
MT metrics compute a score based on the output of a MT system, here called "candidate", and a "reference" sentence, which is provided. The reference is a valid translation of the original source sentence to the target language, usually obtained through a human expert. A metric aims to use the pair of reference and candidate to give a numerical value to the correctness of the translation. A naïve approach would be to directly compare the candidate and reference in order to consider the translation quality. This, however, cannot be a good evaluation criterion since human language has multiple ways of expressing the same idea, and thus there is seldom one unique translation of a sentence from one language to another.
Over the years, a number of metrics have been created based on a variety of ideas and principles.
Count-based metrics compute the n-grams of both reference and candidate and then compare them with each other using a scoring function. One of the most used metrics -BLEU, uses word level n-grams as input to a modified version of precision to evaluate the translation accuracy. Furthermore, a brevity penalty is applied if the candidate is shorter than the reference. CHRF uses the F-score to produce a scoring based on character level n-grams. In most cases, the shift from word level n-grams to the character level results in better human correlation (Popovic, 2015).
Edit distance based metrics utilise the edit distance to express the difference between the candidate and the reference. Since written language allows for the word order to be changed without significant change in meaning, the pure edit distance is too restrictive and is often extended by additional operations. TER extends it by introducing "shifts" which allow for words or phrases to be moved from one position in the candidate to another with a certain cost.
CDER gives another solution to the problem by introducing the operation of jumps. These "jumps" allow for a more flexible alignment. Of course, as in the n-gram based metrics, it is possible to apply these methods at both the word and the character level. CHARACTER uses the edit distance at the character level while keeping the shift operations at the word level with suitably adjusted costs.

Edit Distance
Since the metric presented in this paper belongs to the category of the edit distance based metrics, a more thorough introduction to the concept of edit distance is needed. The goal of the Levenshtein distance is to find the minimum number of operations required to transform the candidate into the reference. The Levenshtein distance in its purest form consists of three basic operations: • Substitution: the act of switching one symbol with another • Deletion: the removal of a symbol • Insertion: the addition of a symbol All of the basic operations are defined as having an uniform cost of one. To not penalise matching symbols with substitutions, substitutions can be defined via the Kroneker delta: 1 − δ(c n , r m ) with c n and r m standing for the symbol at position m ∈ {1, 2 . . . |r|}, n ∈ {1, 2 . . . |c|} for the candidate c and reference r, respectively. The edit distance is then computed as the sum of substitution, insertion and deletion operations made. The edit distance can be efficiently computed via the dynamic programming algorithm by Wagner and Fischer, 1974. This allows for a computation in O(cr).
In MT, the Levenshtein distance is not usually used in its original definition since it does not provide the required flexibility. The reason is that written language allows for multiple ways to express the same concept or idea. To alleviate this problem extensions to the edit distance have been proposed.
The most prominent extension of the edit distance, implemented by both TER and CHARACTER, is the introduction of an additional operation prior to computing the edit distance on the candidate. Namely, to permute the words in the candidate to most closely match the reference. This permutation is termed shift. Since computing all possible shifts of a given sentence is quite costly, in practice, the beam search algorithm is used to reduce the search space.
Another possible extension of the edit distance is to define so called jumps. Jumps provide the opportunity to continue the edit distance computation from a different point. A more detailed explanation of the jumps is presented in the next section.
To obtain a final score, the edit distance is normalised either over the length of the candidate or over the length of the reference. Naturally, in the case where every symbol is wrong and the normalising term is the shorter one of the candidate and the reference, the resulting score may significantly exceed 1.0. This in turn results in scores which are not easily interpretable.

Extended Edit Distance
One aspect of each metric is its input which usually comes in tokenized form. Punctuation marks are separated from words via a white space and abbreviation dots are kept next to the word e.g. "e.g.". EED additionally adds a white space at both beginning and end of each sentence.
EED utilises the idea of jumps as an extension of the edit distance. EED operates at character level and is defined as follows: where e is the sum of the edit operation with uniform cost of 1 for insertions and substitutions and 0.2 for deletions. j denotes the number of jumps performed with the corresponding control parameter α = 2.0. v defines the number of characters that have been visited multiple times or not at all and scales over ρ = 0.3. The parameter values have been optimised based on the average correlation scores (both from and to English) from WMT17 and WMT18 (Bojar et al., 2017;Ma et al., 2018). EED is normalised over the length of the reference |r| and the coverage penalty. To keep it within the [0,1] boundary, the minimum between 1 and the metric score is taken. This makes the metric more robust in cases of extreme discrepancy between candidate and reference length. Jumps are a way to move between characters or blocks thereof and can be incorporated into the dynamic programming algorithm for the Levenshtein distance (Leusch et al., 2006). This provides an optimal solution for the matching between candidate and reference in reasonable computation time. In EED jumps may only be performed when a blank in the reference is reached, allowing the metric to take word boundaries into account and restricting the inter-word jumps. Figure 1 illustrates the way jumps work. Here Die Fans from the reference are aligned with die Fans from the candidate via a jump, after which normal edit distance operations are performed. When the s is reached, another jump is made to the blank before n, in order to align nicht to Nicht. Finally another jump is performed to align the period and white spaces. In total, this results in two edit operation errors (from the difference in capitalisation) and three jumps.
To further refine the metric a coverage penalty is introduced that aims to penalise characters which are aligned to more than once or not at all in the candidate. This allows the metric to penalise repetition of words in the reference with more than just the jump costs. The sum v of visits for all characters visited more than once is computed and is added, after multiplication with a scaling factor ρ to the total cost. To keep the situations where 1 is chosen by the minimum in Equation (2) as few as possible, the coverage penalty is also used in the denominator.
Using only the length of the reference as part of the normalisation factor does not guarantee that the metric score is in the range [0,1]. This is undesirable since scores above one are not interpretable as an error measure. For this reason a number of strategies were considered to enforce this bound: • Taking the maximum length between candidate and reference; • Taking the average length between candidate and reference; • Using just the candidate or just the reference; • Cutting the score to 1.0 if it is above 1.0; • Mapping the score to accuracy via the function 1/(1 − EED) (Zhang et al., 2011).
Out of all of these methods, the simplest and most efficient method is to use the reference as normalisation and to cut the score if it is above one. In our experiments taking the maximum or average between candidate and reference leads to a decline in correlation. The use of accuracy mapping yields different results depending on the parameter setting of the metric and the test set used. For this reason EED uses the cut method for normalisation. Although EED utilises the same movement technique as CDER, there are a few notable differences: • Edit distance is performed on the character level; • Jumps are performed only upon reaching a blank in the reference; • An additional penalty for multiple matching of the same symbol (coverage cost) is applied

Results
EED is implemented in C++ and imported in python via a wrapper. This implementation retains the ease of use of python while getting the speed from a C++ implementation. EED was evaluated via the scripts provided by Ma et al., 2018 as part of WMT18. The evaluation is done both on segment and system level.
The data consists of about 3000 sentences per language pair as part of the newstest2018 test set and provides one reference per translation. In total there are 14 language pairs. For the system level evaluation, direct assessment (DA)  was used to obtain human scores and Pearson's r is used as the correlation coefficient. The segment level uses the relative ranking (RR) which is pooled from system level DA scores. This results in DARR. The correlation coefficient used for the segment level is the Kendall's τ like formulation defined by Graham et al., 2015. To obtain the best possible human correlation, a parameter search was performed over ρ, α and the edit operation costs. For substitution and insertions there is no relevant correlation improvement. However, changes to the deletion cost parameter resulted in human correlation improve-ment. Using the WMT18 segment level test set, a parameter search was performed. Since searching over the whole search space is infeasible, the parameter search was done in a sequential manner. The results of the search are shown in Figure 2. From these results, combined with the findings on WMT16 and WMT17 (Bojar et al., 2016(Bojar et al., , 2017, the deletion cost is set to 0.2. The error distribution of EED was skewed quite heavily towards performing jumps even after restricting jump operation only to blanks on the reference side. For this reason it was restricted further by increasing the jump costs. In order to determine the optimal jump penalty α, a parameter search was performed, which is presented Figure 3. It is evident that the optimal jump cost lie close to 2.0 for the to English direction. For the from English direction the optimum is clear, thus α is set to 2.0.
Similar to the deletion cost and the jump penalty, a parameter search was carried out for the coverage cost in order to increase human correlation. The results of the search are presented in Figure 4. The resulting optimum is ρ = 0.3.
After the parameter tuning, the performance of EED was measured by the human correlation achieved on the WMT18 test set. The results of this measurement obtained at the segment and system level and also in the directions to English and from English are presented in Tables 1 to 4. At the segment level, EED offers competitive results compared with the top-ranking metrics BEER, RUSE and CHRF +. On system level EED performs best for the out of English direction, fol-   lowed by CHARACTER and CDER. For the to English direction, EED is the second best after RUSE.
Apart from human correlation, EED was compared to the performance of the most common metrics. This measurement was performed by letting each metric evaluate 1M (10 6 ) sentence pairs and tracking the time and memory needed to complete the task. The following metrics have been tested: BEER, BLEU, CHARACTER, CHRF, EED. The results of the resource usage test are summarised in Table 5. The fastest is BLEU followed by EED. Concerning memory usage all metrics have similar memory needs, except for the shift based metrics which needed considerably more. Since CHARACTER needs more memory, candidate sentences above 200 words were restricted to 200 words for this test.

Conclusion
A number of different metrics have been developed over the years to help evaluate MT. Metrics such as BLEU and TER have been used for some time, but are surpassed by others both in terms of speed and human correlation.
EED as a metric provides a fast and reliable way to measure human correlation. It achieves competitive human correlation in comparison to the best metrics -BEER and CHRF and surpasses the most used metrics -BLEU and TER. Due to its simplicity and low resource usage it can be used to    quickly evaluate a MT system's output during development.
Since there are a number of metrics based on some extensions of the Levenshtein distance, a more in-depth analysis of the field is required. Furthermore, the relationship between shifts and jumps will be investigated in the future.