chrF: character n-gram F-score for automatic MT evaluation

We propose the use of character n -gram F-score for automatic evaluation of machine translation output. Character n - grams have already been used as a part of more complex metrics, but their individual potential has not been investigated yet. We report system-level correlations with human rankings for 6 -gram F1-score ( CHR F) on the WMT 12, WMT 13 and WMT 14 data as well as segment-level correlation for 6 - gram F1 ( CHR F) and F3-scores ( CHR F3) on WMT 14 data for all available target languages. The results are very promising, especially for the CHR F3 score – for translation from English, this variant showed the highest segment-level correlations out-performing even the best metrics on the WMT 14 shared evaluation task.


Introduction
Recent investigations have shown that character level n-grams play an important role for automatic evaluation as a part of more complex metrics such as MTERATER (Parton et al., 2011) and BEER (Stanojević and Sima'an, 2014a;Stanojević and Sima'an, 2014b). However, they have not been investigated as an individual metric so far. On the other hand, the n-gram based F-scores, especially the linguistically motivated ones based on Part-of-Speech tags and morphemes (Popović, 2011), are shown to correlate very well with human judgments clearly outperforming the widely used metrics such as BLEU and TER. In this work, we propose the use of the Fscore based on character n-grams, i.e. the CHRF score. We believe that this score has a potential as a stand-alone metric because it is shown to be an important part of the previously mentioned complex measures, and because, similarly to the morpheme-based F-score, it takes into account some morpho-syntactic phenomena. Apart from that, in contrast to the related metrics, it is simple, it does not require any additional tools and/or knowledge sources, it is absolutely language independent and also tokenisation independent.
The CHRF scores were calculated for all available translation outputs from the WMT12 (Callison-Burch et al., 2012), WMT13 (Bojar et al., 2013) and WMT14 (Bojar et al., 2014) shared tasks, and then compared with human rankings. System-level correlation coefficients are calculated for all data, segment-level correlations only for WMT14 data. The scores were calculated for all available target languages, namely English, Spanish, French, German, Czech, Russian and Hindi.

CHRF score
The general formula for the CHRF score is: where CHRP and CHRR stand for character ngram precision and recall arithmetically averaged over all n-grams: • CHRP percentage of n-grams in the hypothesis which have a counterpart in the reference; • CHRR percentage of character n-grams in the reference which are also present in the hypothesis.
and β is a parameter which assigns β times more importance to recall than to precision -if β = 1, they have the same importance.

Experiments
As a first step, we carried out several experiments regarding n-gram length. Since the optimal n for word-based measures is shown to be n = 4, MTERATER used up to 10-gram and BEER up to 6-gram, we investigated those three variants. In addition, we investigated a dynamic n calculated for each sentence as the average word length. The best correlations are obtained for 6-gram, therefore we carried out further experiments only on them.
Apart from the n-gram length, we investigated the influence of the space treating it as an additional character. However, taking space into account did not yield any improvement regarding the correlations and therefore has been abandoned.

words
This is an example. characters T h i s i s a n e x a m p l e . +space T h i s i s a n e x a m p l e . In the last stage of our current experiments, we have compared two β values on the WMT14 data: the standard CHRF with β = 1 i.e. the harmonic mean of precision and recall, as well as CHRF3 where β = 3, i.e. the recall has three times more weight. The number 3 has been taken arbitraly as a preliminary value, and the CHRF3 is tested only on WMT14 data -more systematic experiments in this direction should be carried out in the future work.

Correlations with human rankings
System-level correlations The evaluation metrics were compared with human rankings on the system-level by means of Spearman's correlation coefficients ρ for the WMT12 and WMT13 data and Pearson's correlation coefficients r for the WMT14 data. Spearman's rank correlation coefficient is equivalent to Pearson correlation on ranks, and it makes fewer assumptions about the data.
Average system-level correlations for CHRF score(s) together with the word n-gram F-score WORDF and the three mostly used metrics BLEU (Papineni et al., 2002), TER (Snover et al., 2006) and METEOR (Banerjee and Lavie, 2005) are shown in Table 2. It can be seen that the CHRF score is comparable or better than the other metrics, especially the CHRF3 score. Table 3 presents the percentage of translation outputs where the particular F-score metric (WORDF, CHRF and CHRF3) has higher correlation (no ties) than the particular standard metric (BLEU, TER and METEOR). It can be seen that the WORDF score outperforms BLEU and TER for about 60% of documents, however METEOR only in less than 40%. Standard CHRF is better than METEOR for half of the documents, and better than BLEU and TER for 68% of the documents thus being definitely more promising than the wordbased metrics. Finally, CHRF3 score outperforms all standard metric for about 70-80% of texts, thus being the most promising variant.

Segment-level correlations
The segment-level quality of metrics is measured using Kendall's τ rank correlation coefficient. It measures the metric's ability to predict the results of the manual pairwise comparison of two systems. The τ coefficients were calulated only on the WMT14 data using the official WMT14 script, and the obtained WMT14 variant is reported for the WORDF score, both CHRF scores, as well as for the best ranked metrics in the shared evaluation task. Table 4 shows the τ coefficients for translation into English (above) and for translation from English (below). For translation into English, it can be seen that the CHRF3 score is again the most promising F-score. Furthermore, it can be seen that the correlations for both CHRF scores are close to the two best ranked metrics (DISCOTKPARTY and BEER) and the METEOR metrics, which is very well ranked too. For translation from English, the CHRF3 score yields the highest average correlation, and the CHRF score is comparable with the best ranked BEER metric.

Conclusions
The results presented in this paper show that the character n-gram F-score CHRF represents a promising metric for automatic evaluation of machine translation output for several reasons: it is language-independent, tokenisationindependent and it shows good correlations with human judgments both on the system-as well as   Table 3: rank> for three F-scores (WORDF, CHRF and CHRF3) in comparison with three standard metrics (BLEU, TER and METEOR) -percentage of translation outputs where the given F-score metrics has higher correlation than the given standard metric.
Kendall's τ fr-en de-en hi-en cs-en ru-en avg.  Table 4: Segment-level Kendall's τ correlations on WMT 14 data for WORDF, CHRF and CHRF3 score together with the best performing metrics on the shared evaluation task.
on the segment-level, especially the CHRF3 variant. Therefore both of the CHRF scores were submitted to the WMT15 shared metrics task. In future work, different β values should be investigated, as well as different weights for particular n-grams. Apart from this, CHRF is so far tested on only one non-European language (Hindi) -application on more languages using different writing systems such as Arabic, Chinese, etc. has to be explored systematically.