A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing

We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional *diverse* references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language directions of the WMT19 metrics task (at both the system and sentence level) show that using paraphrased references does generally improve BLEU, and when it does, the more diverse the better. However, we also show that better results could be achieved if those paraphrases were to specifically target the parts of the space most relevant to the MT outputs being evaluated. Moreover, the gains remain slight even when human paraphrases are used, suggesting inherent limitations to BLEU’s capacity to correctly exploit multiple references. Surprisingly, we also find that adequacy appears to be less important, as shown by the high results of a strong sampling approach, which even beats human paraphrases when used with sentence-level BLEU.


Introduction
There is rarely a single correct way to translate a sentence; work attempting to encode the entire translation space of a sentence suggests there may be billions of valid translations (Dreyer and Marcu, 2012). Despite this, in machine translation (MT), system outputs are usually evaluated against a single reference. This especially affects MT's dominant metric, BLEU (Papineni et al., 2002), since it is a surface metric that operates on explicit n-gram overlap (see. (1)   Almost since its creation, BLEU's status as the dominant metric for MT evaluation has been challenged (e.g., Callison-Burch et al. (2006), Mathur et al. (2020)). Such work typically uses only a single reference, however, which is a deficient form of the metric, since one of BLEU's raisons d'être was to permit the use of multiple references, in a bid to represent "legitimate differences in word choice and word order." Unfortunately, multiple references are rarely available due to the high cost and effort of producing them. One way to inexpensively create them is with automatic paraphrasing. This has been tried before (Zhou et al., 2006;Kauchak and Barzilay, 2006), but only recently have paraphrase systems become good enough to generate fluent, high quality sentential paraphrases (with neural MT-style systems). Moreover, it is currently unclear (i) whether adding automatically paraphrased references can provide the diversity needed to better cover the translation space, and (ii) whether this increased coverage overlaps with observed and valid MT outputs, in turn improving BLEU's correlation with human judgments.
We explore these questions, testing on all into-English directions of the WMT19 metrics shared task (Ma et al., 2019) at the system and segment level. We compare two approaches: (i) generating diverse references with the hope of covering as much of the valid translation space as possible, and (ii) more directly targeting the relevant areas of the translation space by generating paraphrases that contain n-grams selected from the system out-puts. This allows us to compare the effects of diversity against an upper bound that has good coverage. We anchor our study by comparing automatically produced references against humanproduced ones on a subset of our data.
Our experiments show that adding paraphrased references rarely hurts BLEU and can provide moderate gains in its correlation with human judgments. Where it does help, the gains are correlated with diversity (and less so adequacy), but see diminishing returns, and fall short of the nondiverse method designed just to increase coverage. Manual paraphrasing does give the best systemlevel BLEU results, but even these gains are relatively limited, suggesting that diversity alone has its limits in addressing weaknesses of surfacebased evaluation metrics like BLEU.

Related Work
Paraphrasing for MT evaluation There is a long history of using paraphrasing to overcome the limitations of BLEU-style metrics. Some early approaches rely on external resources (e.g. Word-Net) to provide support for synonym matching (Banerjee and Lavie, 2005;Kauchak and Barzilay, 2006;Denkowski and Lavie, 2014). More automatic methods of identifying paraphrases have also been developed. An early example is ParaEval (Zhou et al., 2006), which provides local paraphrase support using paraphrase sets automatically extracted from MT phrase tables. More recently, Apidianaki et al. (2018) exploit contextual word embeddings to build automatic HyTER networks. However they achieve mixed results, particularly when evaluating high performing (neural) models.
The use of MT systems to produce paraphrases has also been studied previously. Albrecht and Hwa (2008) create pseudo-references by using out-of-the-box MT systems and see improved correlations with human judgments, helped by the systems being of better quality than those evaluated. This method was extended by Yoshimura et al. (2019), who filter the pseudo-references for quality. An alternative strategy is to use MT-style systems as paraphrasers, applied to the references. Madnani et al. (2007) show that additional (paraphrased) references, even noisy ones, reduce the number of human references needed to tune an SMT system, without significantly affecting MT quality. However their aim for coverage over quality means that their paraphrases are unlikely to be good enough for use in a final evaluation metric.
Despite the attention afforded to the task, success has been limited by the fact that until recently, there were no good sentence-level paraphrasers (Federmann et al. (2019) showed that neural paraphrasers can now outperform humans for adequacy and cost). Attempts (e.g. Napoles et al., 2016) using earlier MT paradigms were not able to produce fluent output, and publicly available paraphrase datasets have only been recently released (Wieting and Gimpel, 2018;Hu et al., 2019a). Moreover, most works focus on synonym substitution rather than more radical changes in sentence structure, limiting the coverage achieved.
Structurally diverse outputs Diverse generation is important to ensure a wide coverage of possible translations. Diversity, both lexical and structural, has been a major concern of text generation tasks (Colin and Gardent, 2018;Iyyer et al., 2018). State-of-the-art neural MT-style text generation models used for paraphrasing (Prakash et al., 2016;Mallinson et al., 2017) typically suffer from limited diversity in the beam. Techniques such as sampling from the model distribution or from noisy outputs have been proposed to tackle this (Edunov et al., 2018) but can harm output quality.
An effective strategy to encourage structural diversity is to add syntactic information (which can be varied) to the generated text. The constraints can be specified manually, for example by adding a parse tree (Colin and Gardent, 2018;Iyyer et al., 2018) or by specifying more abstract constraints such as rewriting embeddings (Xu et al., 2018). A similar but more flexible approach was adopted more recently by Shu et al. (2019), who augment target training sentences with cluster pseudotokens representing the structural signature of the output sentence. When decoding, the top cluster codes are selected automatically using beam search and for each one a different hypothesis is selected. We adopt Shu et al.'s approach here, due to the automatic nature of constraint selection and the flexibility afforded by constraint definition, allowing us to test different types of diversity by varying the type of sentence clustering method.

Generating paraphrased references
We look at two ways to produce paraphrases of English references using English-English NMT architectures. The first (Sec. 3.1) aims for maximal lexical and syntactic diversity, in a bid to better cover the space of valid translations. In contrast, the second (Sec. 3.2) aims to produce paraphrases that target the most relevant areas of the space (i.e. that are as close to the good system outputs as possible). Of course, not all outputs are good, so we attempt to achieve coverage while maintaining adequacy to the original reference by using information from the MT outputs. While less realistic practically, this approach furthers the study of the relationship between diversity and valid coverage.

Creating diverse paraphrases
To encourage diverse paraphrases, we use Shu et al.'s (2019) method for diverse MT, which consists in clustering sentences according to their type and training a model to produce outputs corresponding to each type. Applied to our paraphrasing scenario, the methodology is as follows: 1. Cluster target sentences by some property (e.g., semantic, syntactic representation); 2. Assign a code to each cluster and prefix each target sentence in the training data with its code (a pseudo-token), as follows: (2) cl 14 They knew it was dangerous .
cl 101 They had chickens, too . cl 247 That 's the problem .
3. Train an NMT-style paraphrase model using this augmented data; 4. At test time, apply the paraphraser to each reference in the test set; beam search is run for each of the n most probable sentence codes to produce n paraphrases per reference.
As in (Shu et al., 2019), we test two different types of diversity: semantic using LASER sentential embeddings (Artetxe and Schwenk, 2019) and syntactic using a TreeLSTM encoder (Tai et al., 2015). Both methods encode each sentence as a vector, and the vectors are clustered using kmeans into 256 clusters (full details in App. C).
Syntactic: As in (Shu et al., 2019), we encode constituency trees into hidden vectors using a TreeLSTM-based recursive autoencoder, with the difference that we use k-means clustering to make the method more comparable to the above, and we encode syntactic information only.

Output-guided constrained paraphrases
Diversity is good, but even a highly diverse set of references may not necessarily be in the same space as the MT outputs. We attempt to achieve high coverage of the system outputs by using a weak signal from those outputs. The signal we use is unrewarded n-grams from the best systems, which are n-grams in system outputs absent from the original reference. We identify them as follows. For each sentence in a test set, we find all n-grams that are (a) not in the reference but (b) are present in at least 75% of the system outputs, (c) limited to the top half of systems in the human system-level evaluation (Barrault et al., 2019). Then, for each such n-gram, we generate one paraphrase of the reference using constrained decoding (Post and Vilar, 2018), with that n-gram as a constraint. This gives a variable-sized set of paraphrased references for each sentence. In order to limit overfitting to the best systems, we use a cross-validation framework, in which we randomly split the submitted systems into two groups, the first used to compute the n-gram constraints and the augmented references, and the second half for evaluation. We repeat this ten times and report the average correlation across the splits.

Experiments
Our goal is to assess whether we can generate paraphrases that are representative of the translation space and which, when used with BLEU, improve its utility as a metric. We therefore carry out experiments to (i) evaluate the adequacy and diversity of our paraphrases (Sec. 5.2) and (ii) compare the usefulness of all methods in improving BLEU's correlation with human judgments of MT quality (Sec. 4.1). BLEU is a corpus-level metric, and our primary evaluation is therefore its systemlevel correlation. However, it is often also used at the segment level (with smoothing to avoid zero counts). It stands to reason that multiple references would be more important at the segmentlevel, so we also look into the effects of adding paraphrase references for SENTBLEU too.

Metric evaluation
For each set of extra references, we produce multireference BLEU and SENTBLEU metrics, which we use to score all into-English system outputs from the WMT19 news task. 3 We evaluate the scores as in the metrics task (Ma et al., 2019), by calculating the correlation with manual direct assessments (DA) of MT quality (Graham et al., 2013). System-level scores are evaluated using Pearson's r and statistical significance of improvements (against single-reference BLEU) using the Williams test (Williams, 1959). Segment-level correlations are calculated using Kendall's τ (and significance against single-reference SENTBLEU with bootstrap resampling) on the DA assessments transformed into relative rankings.

Baseline and contrastive systems
Our true baselines are case-sensitive corpus BLEU and SENTBLEU, both calculated using sacreBLEU (Post, 2018) using the standard BLEU formula. Though likely familiar to the reader, we review it here. BLEU is computed by averaging modified n-gram precisions (p n , n = 1..4) and multiplying this product by a brevity penalty (BP), which penalizes overly short translations and thereby works to balance precision with recall: with c and r the lengths of the hypothesis and reference sets respectively, H is the set of hypothesis translations, # (ngram) the number of times ngram appears in the hypothesis, and # clip (ngram) is the same but clipped to the maximum number of times it appears in any one reference. By definition, BLEU is a corpus-level metric, since the statistics above are computed across sentences over an entire test set. The sentence-level variant requires a smoothing strategy to counteract the effect of 0 n-gram precisions, which are more probable with shorter texts. We use exponential smoothing. Both baselines use the single provided reference only. We also compare against several contrastive paraphrasing approaches: (i) BEAM, which adds to the provided reference the the nbest hypotheses in the beam of a baseline paraphraser, and (ii) SAMPLED, which samples from the top 80% of the probability mass at each time step (Edunov et al., 2018). For the sentence encoding methods, we also include (iii) RANDOM, where randomly selected cluster codes are used at training and test time.
As a topline, we compare against manually paraphrased references (HUMAN), which we produce for a subset of 500 sentences from the de-en test set. Two native English speakers together produced five paraphrases per reference (alternately two or three paraphrases). They were instructed to craft paraphrases that were maximally different (lexically and syntactically) from both the reference and the other paraphrases (to which they had access), without altering the original meaning.

Paraphrase model training
We train our paraphrasers using data from Parabank 2 (Hu et al., 2019b), containing ≈20M sentences with up to 5 paraphrases each, of which we use the first paraphrase only. We preprocess by removing duplicate sentences and those longer than 100 words and then segment into subwords using SentencePiece (Kudo and Richardson, 2018) (unigram model (Kudo, 2018) of size 16k). The data splits are created by randomly shuffling the data and reserving 3k pairs each for dev and test. For syntactic sentence encoding methods, we use the Berkeley Parser (Petrov et al., 2006) (internal tokenisation and prioritizing accuracy) and prune trees to a depth of 4 for ≈6M distinct trees. 4 Paraphrase models are Transformer base models (Vaswani et al., 2017) (Cf. App. B for details). All models are trained using the Marian NMT toolkit (Junczys-Dowmunt et al., 2018), except for SAMPLED and the constraint approach, for which we use the Sockeye toolkit (Hieber et al., 2018), since Marian does not support these features.
For baseline models, we produce n additional references by taking the n-best in the beam (using a beam size of 20, which is the maximum number of additional references we test). For models using cluster codes, paraphrases are produced by selecting the n-best cluster codes at the first decoding step and then decoding each of these hypotheses using separate beam searches (of size 6).

Adequacy
To ensure our automatically produced paraphrases are of sufficient quality, we first assess their adequacy (i.e., faithfulness to the original meaning). We determine adequacy by manually eval- An agreement has now been made. The reasons behind Lindsay Lohan's such bizarre acts are completely obscure for now.
They have reached an agreement. Table 1: Direct assessment (DA) adequacy scores for the BEAM and SAMPLED baseline, the two diverse approaches and human paraphrases for the 100-sentence de-en subset. We also provide each method's top 3 paraphrases for two references.
uating paraphrases of the first 100 sentences of the de-en test set. We compare a subset of the automatic methods (BEAM, SAMPLED, LASER, TREELSTM) as well as HUMAN. 5 annotators (2 native and 3 fluent English speakers) rated the paraphrases' adequacy using DA, indicating how well (0-100) the official reference's meaning is preserved by its paraphrases. 25 judgments were collected per sentence (sampling from each system's top 5 paraphrases) System-level scores are produced by averaging across all annotations.
The results and examples of some of the paraphrased references are given in Tab. 1 (more examples are given in App. G). Whilst the task is inherently subjective, we see a clear preference for human paraphrases, providing a reference point for interpreting the scores. The automatic paraphrase systems are not far behind, and the scores are further corroborated by the lowest score being assigned to the sampled output, which we expect to be less faithful to the reference meaning.

Diversity
We evaluate the diversity of paraphrased references using two diversity scores (DS): where Y is the set of paraphrases of a sentence produced by a given system, and ∆ x calculates the similarity of paraphrases y and y . We use two different functions: ∆ BOW (for lexical similarity) and ∆ tree (for syntactic similarity). Both give scores between 1 (identical) and 0 (maximally diverse), DS BOW is the lexical overlap between the sets of words in two paraphrases. ∆ BOW (y, y ) corresponds to the number of unique words in common between y and y , divided by their mean length.
DS tree uses ∆ tree , the average tree kernel similarity score between paraphrases. We compute tree kernels using the "subset tree" (SST) comparison tree kernel similarity function presented in (Moschitti, 2006, §2.2), with a decay value of λ = 0.5, and excluding leaves (σ = 0).  The results (Tab. 2) show that all methods other than RANDOM give more diversity than BEAM. Shu et al.'s cluster code method generates diverse paraphrases. As expected, random cluster codes are not helpful, producing mostly identical para-phrases differing only in the cluster code. Diversity increases for all methods as paraphrases are added. TREELSTM produces structurally more diverse paraphrases than LASER and has high lexical diversity too, despite codes being entirely syntactic, suggesting that structural diversity leads to varied lexical choices. The most lexically and structurally diverse method (except for HUMAN), is in fact the strong baseline SAMPLED, which is likely due to the noise added with the method.
The increased diversity is generally reflected by an increase in the average BLEU score (final column of Tab. 2). These higher BLEU scores indicate that the additional paraphrases are better covering the translation space of the MT outputs, but it remains to be seen whether this concerns the space of valid and/or invalid translations. In contrast, some of the diversity makes less of an impact on the BLEU score; the gap in syntactic diversity between LASER and TREELSTM (+20 references) is not reflected in a similar gap in BLEU score, indicating that this added diversity is not relevant to the evaluation of these specific MT outputs.

Metric Correlation Results
The correlation results for each of the metrics (both system-and segment-level) for different numbers of additional references 5 (aggregated full results) are shown in Tab. 3a and Tab. 3b (for the de-en 500-sample subset). We aggregate the main results to make them easier to interpret by averaging over all into-English test sets (the Ave. column) and we also provide the gains for the language pairs that gave the smallest and greatest gains (Min and Max respectively). Full raw results can be found in App. D.
System-level Adding paraphrased references does not significantly hurt performance, and usually improves it; we see small gains for most languages (Ave. column), although the size of the gain varies, and correlations for two directions (fien and gu-en) are degraded but non-significantly (shown by the small negative minimum gains). Fig. 1 (top) shows that for the diverse approaches, the average gain is positively correlated with the method's diversity: increased diversity does improve coverage of the valid translation space. This positive correlation holds for all directions for which adding paraphrases helps (i.e., all except fi-en and gu-en). For these exceptions, none of the methods significantly improves over the baseline, and RANDOM gives as good if not marginally better results. The constraints approach achieves the highest average gain, suggesting that it is more efficiently targeting the space of valid translations, even though its paraphrases are significantly less diverse (Tab. 2).
Finally, and in spite of these improvements, we note that all systems fall far short of the best WMT19 metrics, shown in the last row. Automatic paraphrases do not seem to address the weakness of BLEU as an automatic metric.
Segment-level Similar results can be seen at the segment level, with most diverse approaches showing improvements over the baseline (this time SENTBLEU) and a minority showing nonsignificant deteriorations (i.e., no change). The diversity of the approaches is again positively correlated with the gains seen (Fig. 1, bottom), with the exception of zh-en, for no easily discernable reason.
The best result of the diverse approaches is again achieved by the SAMPLED baseline.
The constraint-based approach achieves good scores, comparable to SAMPLED, despite an anomalously poor score for one language pair (for kk-en, with a degradation of 0.097. This approach also had the highest BLEU scores, however, suggesting that the targeted paraphrasing approach here missed its mark. De-en 500-sentence subset The general pattern shows the same as the averages over all languages in Tab. 3a, with the more diverse methods (especially SAMPLED) resulting in the greatest gains. The human results also follow this pattern, resulting in the highest gains of all at the system level. Interestingly, the constrained system yields higher average BLEU scores than HUMAN (Tab. 2) yet a comparable system correlation gain, indicating it targets more of the invalid translation space. For this particular subset, the constraints-based approach helps slightly more at the segment level than the system level, even surpassing the human paraphrases in terms of relative gains, despite it having remarkably less diversity.

Discussion
Does diversity help? In situations where adding paraphrases helps (which is the case for a majority   of language directions), the diversity of those paraphrases tends to positively correlate with gains in metric performance for both BLEU and SENT-BLEU. The adequacy of the paraphrases appears to be a less important factor, shown by the fact that the best automatic diverse method at both levels was the SAMPLED baseline, the most diverse but the least adequate. 6 The comparison against human paraphrases on the de-en subsample suggests room for improvement in automated techniques, at least at the system level, where all automatic metrics are beaten by HUMAN paraphrases, which are both more diverse and more adequate.
However, diversity is not everything; although HUMAN has nearly twice the lexical diversity of SAMPLED, it improves BLEU only somewhat and harms sentence BLEU. On the other side, targeted constraints have relatively low diversity, but higher correlation gains. Diversity itself does not necessarily result in coverage of the space occupied by good translation hypotheses.
What effect do more references have? Diversity increases the more paraphrases there are and it is positively correlated with gains for most language directions. However, improvements are slight, especially with respect to what we would hope to achieve (using human references results in much more diversity and also greater improvements). The relationship between the number of extra references and system-level correlations shown in Fig. 2 suggests that increasing the number of references results in gains, but for most test sets, the initial paraphrase has the most impact and the subsequent ones lead to lesser gains or even occasional deteriorations. Similar results are seen at the segment level. Why are gains only slight? With respect to the SENTBLEU baseline, we calculate the percentage of comparisons for which the decision is improved (the baseline scored the worse translation higher than the better one and the new paraphraseaugmented metric reversed this) 7 and for which the decision is degraded (opposite reversal). The results (Fig. 3) show that although all the systems improve a fair number of comparisons (up to 9.6%), they degrade almost as many. So, while paraphrasing adds references that represent the space of valid translations, references are indeed being added that match with the space of invalid ones too. Interestingly, the same pattern can be seen for human paraphrases, 6.46% of comparisons being degraded vs. 8.30% improved, suggesting that even when gold standard paraphrases are produced, the way in which the references are used by SENTBLEU still rewards some invalid translations, though the balance is shifted slightly in favour of valid translations. This suggests that at least at the segment level, BLEU is a balancing act between rewarding valid translations and avoiding rewarding invalid ones. Some of these effects may be smoothed out in system-level BLEU but there is still likely to be an effect. It is worth noting that for the two languages directions, fi-en and gu-en, for which diversity was negatively cor-related with correlation gain (i.e., diversity could be harming performance), the most conservative approach (RANDOM) leads to some of the best results.
What is the effect on individual n-grams? We study which new n-grams are being matched by the additional references for the two language directions with the largest system-level correlation gain (ru-en and de-en). For each sentence, we collect and count the n-grams that were not in the original reference but where in the five paraphrased references of BEAM (missing n-grams), 8 accumulated across all test set sentences. We also looked at the most frequent n-grams not found at all, even with the help of the paraphrases (i.e., the unrewarded n-grams from Sec. 3.2). The results are in Table 4.
Unsurprisingly, most 1-grams are common grammatical words (e.g., a, of, to, in, the) that may be present (or not) in any sentence; it is hard to draw any conclusions. For 4-grams, however, we see some interesting patterns. Present in both lists are acronym variants such as U . S . for 'United States' and p . m . for 'afternoon' or the 24-hour clock; their presence on both sides indicates success in sometimes grabbing this variant as well as failure to do so consistently. We also see phrasal variants such as , according to and , " he said. These last points corroborate a point made by Freitag et al. (2020, §7.2) that references may omit these common variants. It also suggests a more focused method for generating paraphrases: identify a high-precision set of common variants, and ensure their presence in the set of references, via constrained decoding or other means (in the spirit of Meteor's (Denkowski and Lavie, 2011) synonym-based matching). We note however, that our paraphrasing methods do seem to contain complementary information as they also tend to improve Meteor too (see results in App. F).

Conclusion
We studied the feasibility of using diverse automatic paraphrasing of English references to improve BLEU. Although increased diversity of paraphrases does lead to increased gains in correlation with human judgments at both the system and segment levels, the gains are small and inconsistent. We can do a slightly better job by using N newly matched ngrams missing ngrams 1 a (494) of (480) , (442) to (370) in (364) The (315) the (273) is (204) for (196) has (196) on (193) was (179) have (171) that (166) be (155) at (145) been (140) with (138) and (134) to (921) in (921) on (870) is (802) of (798) a (786) for (568) The (556) with (509) (13) in accordance with the (12) the United States , (11) in the United States (10) a member of the (10) of the United States (9) The U . S (9) . m . on (9) , in order to (9) the United States and (8)   cues from the system outputs themselves to produce paraphrases providing a helpful form of "targeted" diversity. The comparison with manually produced paraphrases shows that there is room for improvement, both in terms of how much diversity is achieved and how much BLEU can be improved. However, the lack of any improvement in some languages points to how hard it is to target this "right kind" of diversity a priori; this, together with the relatively limited gains overall (especially in comparison with the best WMT19 metrics), suggests an intrinsic limit to BLEU's capacity to handle multiple references.

B Paraphraser training details
All paraphrase models are Transformer base models (Vaswani et al., 2017): 6 layers, 8 heads, word embedding dimension of 512, feedforward dimension of 2048. We set dropout to 0.1 and tie all embeddings to the output layer with a shared vocabulary size of 33,152. We use the same vocabulary (including the 256 cluster codes) for all models. We adopt Adam optimisation with a scheduled learning rate (initial 3 × 10 −4 ) and mini-batch size of 64. We train each model on 4 GTX Titan X GPUs with a gradient update delay of 2, and select the final model based on validation BLEU.

C Sentence clustering training details
We set k to 256 for k-means clustering. We train TREELSTM sentence encoders using Adagrad with a learning rate of 0.025, weight decay of 10 −4 and batch size of 400 for a maximum of 20 iterations. We set the model size to 256 and limit the maximum number of child nodes to 10. Table 7 shows the raw correlations of each each paraphrase-augmented BLEU metric on WMT19 (system-level results top and segment-level results bottom). These correspond to the raw scores used to calculate the gains of each method with respect to the true baseline (BLEU or sentenceBLEU) shown in the main results section in Table 3. We indicate the best system from WMT19 as a point of reference.

F Results with the Meteor metric
Although we focus on ways of improving BLEU using paraphrases in this article, as BLEU is the dominant metric, it is also interesting to look at how adding paraphrases could help similar metrics. We apply the same method to improving the Meteor metric (version 1.5) (Denkowski and Lavie, 2014), a metric which already integrates synonym support. Summarised results (as gains with respect to the single-reference Meteor metric) are shown in Tab. 8 and raw results are shown in Tab. 9 for both system-level and segment-level correlations. We observe that the true baselines (Meteor and sen-tenceMeteor) are improved in both cases, possibly more so than BLEU and in different ways, showing that the information added by the paraphrases is complementary to the synonym support offered by Meteor.

G Further examples of automatically paraphrased references
We provide additional examples of paraphrased references. As can be seen from    . Results that are significantly better than the METEOR baseline are indicated as follows (at least p ≤ 0.05) are marked in bold.