chrF deconstructed: beta parameters and n-gram weights

Character n -gram F-score ( CHR F) is shown to correlate very well with human rankings of different machine translation outputs, especially for morphologically rich target languages. However, only two versions have been explored so far, namely CHR F1 (standard F-score, β = 1) and CHR F3 ( β = 3), both with uniform n -gram weights. In this work, we investigated CHR F in more details, namely β parameters in range from 1/6 to 6, and we found out that CHR F2 is the most promising version. Then we investigated different n -gram weights for CHR F2 and found out that the uniform weights are the best option. Apart from this, CHR F scores were systematically compared with WORD F scores, and a preliminary experiment carried out on small amount of data with direct human scores indicates that the main advantage of CHR F is that it does not penalise too hard acceptable variations in high quality translations.


Introduction
Recent investigations (Popović, 2015;Stanojević et al., 2015) have shown that the character n-gram F-score (CHRF) represents a very promising evaluation metric for machine translation, especially for morphologically rich target languages -it is simple, it does not require any additional tools or information, it is language independent and tokenisation independent, and it correlates very well with human rankings. However, only two versions of this score have been investigated so far: standard F-score CHRF1 where β = 1, i.e. precision and recall have the same weight, as well as CHRF3, where recall has three times more weight.
The CHRFβ and WORDFβ scores are calculated for all available translation outputs from the WMT14 (Bojar et al., 2014) and WMT15   shared tasks and then compared with human rankings on segment level using Kendall's τ rank correlation coefficient.
The scores were analysed for all available target languages. i.e. English, French, German, Czech, Russian, Hindi and Finnish.

CHRF and WORDF scores
The general formula for n-gram based F-score is: where ngrP and ngrR stand for n-gram precision and recall arithmetically averaged over all ngrams from n = 1 to N: • ngrP n-gram precision: percentage of n-grams in the hypothesis which have a counterpart in the reference; • ngrR n-gram recall: percentage of n-grams in the reference which are also present in the hypothesis.
and β is a parameter which assigns β times more weight to recall than to precision. If β = 1, they have the same weight; if β = 4, recall has four times more importance than precision; if β = 1/4, precision has four times more importance than recall. WORDF is then calculated on word n-grams and CHRF is calculated on character n-grams. Maximum n-gram length N for both metrics is investigated in previous work, and N=4 is shown to be optimal for WORDF (Popović, 2011), N=6 for CHRF (Popović, 2015).

Comparison of CHRFβ and WORDFβ scores
The CHRFβ and WORDFβ scores are calculated for the following β parameters: 1/6, 1/5, 1/4, 1/3, 1/2, 1, 2, 3, 4, 5 and 6. For each CHRFβ and WORDFβ score, the segment level τ correlation coefficients are calculated for each translation output. In total, 20 τ coefficients were obtained for each score -five English outputs from the WMT14 task and five from the WMT15, together with ten outputs in other languages, i.e. two French, two German, two Czech, two Russian, one Hindi and one Finnish. The obtained τ coefficients were then summarised into the following four values: • mean τ averaged over all translation outputs; • diff averaged difference between the τ of the particular metric and the τ s of all other metrics investigated in this work; • rank> percentage of translation outputs where the particular metric has better τ than the other metrics investigated in this work; • rank percentage of translation outputs where the particular metric has better or equal τ than the other metrics investigated in this work.
These values for each metric are presented in Table 1. In addition, the values are shown separately for translation into English (Table 2) and for translation out of English (Table 3). Table 1 shows that: • CHRF ranks better than WORDF; • recall is more important than precision; • the most promising metric is CHRF2;  Table 1: Overall average segment level (τ ) correlation mean (column 1), diff (column 2), rank> (column 3) and rank (column 4) for each CHRFβ score. Bold represents the overall best value and underline represents the best WORDFβ value. The most promising metric is CHRF2.
• β = 2 is the best option both for CHRF (bold) as well as for WORDF (underline).
Additional observations from Tables 2 and 3: • for translation into English: the most promising metrics are CHRF2 and CHRF1; the best WORDFβ variant is WORDF2.
• for translation out of English: the most promising metrics are CHRF2 and CHRF3 the best WORDFβ variants are WORDF2 and WORDF3 indicating that the recall is even more important for morphologically rich(er) languages. Regardless to these slight differences between English and non-English texts, CHRF2 can be considered as the most promising variant generally.  Table 2: Translation into English: average segment level (τ ) correlation mean (column 1), diff (column 2), rank> (column 3) and rank (column 4) for each CHRFβ score. Bold represents the overall best value and underline represents the best WORDFβ value. The most promising metric is CHRF2.
However, taking these differences into account together with the fact that for English, CHRF1 performed better than CHRF3 in the WMT15 metrics shared task, we decided to submit CHRF2 together with CHRF1 and CHRF3 in order to be able to draw more reliable conclusions.

Investigating n-gram weights for CHRF2
As already mentioned, all CHRFβ variants explored so far are based on the uniform distribution of n-gram weights. Nevertheless, one can assume that character n-grams of different lengths are not equally important -for example, it is conceivable that character 1-grams are not really important for assessment of translation quality. Therefore we carried out the following experiment on the best CHRF variant, namely CHRF2. First step was to examine τ coefficients independently for each ngram. The results presented in Table ? Table 3: Translation from English: average segment level (τ ) correlation mean (column 1), diff (column 2), rank> (column 3) and rank (column 4) for each CHRFβ score. Bold represents the overall best value and underline represents the best WORDFβ value. The most promising metric is CHRF2.
The τ coefficients for each n-gram weight distribution are shown in Table 4 -although some of (a) individual n-grams  Table 4: Analysis of n-grams: (a) average τ for individual n-grams (b) τ on WMT14 (left) and WMT15 (right) documents for different n-gram weight distributions.
the proposed distributions outperform the uniform one for some of the texts, especially for translation out of English, none of them is unquestionably better than the uniform distribution of weights. Therefore, the uniform n-gram weights were used for the WMT16 metrics task.

CHRF and WORDF for good and bad translations
In order to try to better understand the differences between WORDF and CHRF scores, i.e. the advantages of the CHRF score, we carried out a preliminary experiment on three data sets for which the absolute (direct) human scores were available. The data sets are rather heterogeneous: they contain three different target languages, they were produced and evaluated independently, for different purposes, and the human scores were not defined in the same way. In addition, two of the three data sets are rather small. Therefore the described experiment is rather preliminary, however we believe that it represents a good starting point for further research regarding differences between word and character based metrics.
τ coefficients for comparing four systems using direct human scores The starting point was testing τ coefficients for CHRF2 and WORDF2 on the English→Spanish data set described in (Specia et al., 2010) and the motivation was simply to explore the correlations obtained on direct human scores instead of relative rankings. The data set contains 4000 source segments and their reference translations, machine translation outputs of four SMT systems, as well as human estimations of required post-editing effort in the interval from 1 (requires complete retranslation) to 4 (fit for purpose). The distribution of segments with each of the four human ratings for each of the systems is shown in Table 5a and it can be seen that the fourth system is significantly worse than the other three, which are rather close. The obtained τ coefficients (Table 5b, first column) were however puzzling -the τ coefficients are very close, the one for the WORDF2 is even slightly higher, which is a rather different result than all the results described in the previous sections and related work. On the other hand, taking into account that the number of systems is small, as well as that the performance of the fourth sys-  Table 5: English→Spanish data set with direct human scores: (a) percentage of the sentence level human scores for each of the four systems together with the average human score for each system -system 4 is significantly worse than the other three. (b) τ coefficients for all four systems (first column) and for the three similar systems (second column).
tem is clearly distinct than of the others, another experiment is carried out: the worst system is removed and only the remaining three similar systems are compared. For this set-up, the expected results were obtained (second column), i.e. the τ coefficients are higher for the CHRF2 score. This somewhat controversial finding lead to the following two hypotheses: 1. word-based metrics are good at distinguishing systems/segments of distinct quality but not so good at ranking similar systems/segments; 2. word-based metrics are good for evaluating low quality systems/segments but not so good for evaluating high quality systems/segments.

Standard deviations of automatic metrics for different direct human scores
In order to further examine the two hypotheses, the following experiment has been carried out: for each of the human ratings, standard deviation of the corresponding automatic scores is calculated. This experiment is carried out on the previously described data set as well as on two additional small 1 data sets: • English→Irish SMT translations rated from 1 to 4 for the overall quality (1=bad, 4=good);  • English→Serbian SMT translations rated from 1 to 5 in terms of adequacy and fluency (1=bad, 5=good) -the mean value of the two has been taken as the direct human score.
The obtained standard deviations in Table 6 show that for poorly rated sentences, the deviations of CHRF2 and WORDF2 are similar -both metrics assign relatively similar (low) scores. On the other hand, for the sentences with higher human rates, the deviations for CHRF2 are (much) lower. In addition, the higher the human rating is, the greater is the difference between the WORDF2 and CHRF2 deviations. These results confirm the hypothesis 2), namely that CHRF is better than WORDF mainly for segments/systems of higher translation quality. The most probable reason is that CHRF, contrary to the word-based metrics, does not penalise too hard acceptable morpho-syntactic variations. The CHRF scores for good translations are therefore more concentrated in the higher range, whereas the WORDF scores are often too low. The results are also consistent with the hypothesis 1), however this one is confirmed only partially since the outlier is a low quality system -further work should include comparison of different low quality systems.
Nevertheless, as stated at the beginning of the section, it should be kept in mind that this is only a preliminary experiment in this direction, performed on very limited amount of data. Further experiments on large data sets, more systems and more languages should be carried out in order to get more reliable results and better insight into underlying phenomena.

Summary and outlook
The results presented in this work show that generally, the F-scores which are biased towards recall correlate better with human rankings than those biased towards precision. Particularly, it is shown that CHRF2 version of the CHRF score with uniform n-gram weights is the most promising for machine translation evaluation. Therefore this/these version has been submitted to the WMT16 metrics task, however together with CHRF1 and CHRF3 in order to explore differences between English and morphologically richer target languages more systematically.
In addition, it is shown that the CHRF score performs better than the WORDF score. Preliminary experiments on small data sets with available direct human scores show that for sentences of higher translation quality, standard deviations of WORDF is much larger than standard deviations of CHRF, indicating that the main advantage of the CHRF is that it does not penalise too strong different variants of acceptable translations. However, more systematic experiments on large data sets should be carried out in this direction. Furthermore, a broader investigation including different word and character based metric in addition to the two presented F-scores would be useful.
Apart from this, application of CHRF on more distinct languages such as Arabic, Chinese etc. should be explored.