Are BLEU and Meaning Representation in Opposition?

One of possible ways of obtaining continuous-space sentence representations is by training neural machine translation (NMT) systems. The recent attention mechanism however removes the single point in the neural network from which the source sentence representation can be extracted. We propose several variations of the attentive NMT architecture bringing this meeting point back. Empirical evaluation suggests that the better the translation quality, the worse the learned sentence representations serve in a wide range of classification and similarity tasks.


Introduction
Deep learning has brought the possibility of automatically learning continuous representations of sentences. On the one hand, such representations can be geared towards particular tasks such as classifying the sentence in various aspects (e.g. sentiment, register, question type) or relating the sentence to other sentences (e.g. semantic similarity, paraphrasing, entailment). On the other hand, we can aim at "universal" sentence representations, that is representations performing reasonably well in a range of such tasks.
Regardless the evaluation criterion, the representations can be learned either in an unsupervised way (from simple, unannotated texts) or supervised, relying on manually constructed training sets of sentences equipped with annotations of the appropriate type. A different approach is to obtain sentence representations from training neural machine translation models (Hill et al., 2016).
Since Hill et al. (2016), NMT has seen substantial advances in translation quality and it is thus natural to ask how these improvements affect the learned representations.
One of the key technological changes was the introduction of "attention" , making it even the very central component in the network (Vaswani et al., 2017). Attention allows the NMT system to dynamically choose which parts of the source are most important when deciding on the current output token. As a consequence, there is no longer a static vector representation of the sentence available in the system.
In this paper, we remove this limitation by proposing a novel encoder-decoder architecture with a structured fixed-size representation of the input that still allows the decoder to explicitly focus on different parts of the input. In other words, our NMT system has both the capacity to attend to various parts of the input and to produce static representations of input sentences.
We train this architecture on English-to-German and English-to-Czech translation and evaluate the learned representations of English on a wide range of tasks in order to assess its performance in learning "universal" meaning representations.
In Section 2, we briefly review recent efforts in obtaining sentence representations. In Section 3, we introduce a number of variants of our novel architecture. Section 4 describes some standard and our own methods for evaluating sentence representations. Section 5 then provides experimental results: translation and representation quality. The relation between the two is discussed in Section 6.

Related Work
The properties of continuous sentence representations have always been of interest to researchers working on neural machine translation. In the first works on RNN sequence-to-sequence models,  and Sutskever et al. (2014) Table 1: Different RNN-based architectures and their properties. Legend: "pooling" -vectors combined by mean or max (AVGPOOL, MAXPOOL); "sent. emb." -sentence embedding, i.e. the fixed-size sentence representation computed by the encoder. "init" -initial decoder state. "ctx" -context vector, i.e. input for the decoder cell. "input for att." -input for the decoder attention.
provided visualizations of the phrase and sentence embedding spaces and observed that they reflect semantic and syntactic structure to some extent. Hill et al. (2016) perform a systematic evaluation of sentence representation in different models, including NMT, by applying them to various sentence classification tasks and by relating semantic similarity to closeness in the representation space. Shi et al. (2016) investigate the syntactic properties of representations learned by NMT systems by predicting sentence-and word-level syntactic labels (e.g. tense, part of speech) and by generating syntax trees from these representations. Schwenk and Douze (2017) aim to learn language-independent sentence representations using NMT systems with multiple source and target languages. They do not consider the attention mechanism and evaluate primarily by similarity scores of the learned representations for similar sentences (within or across languages).

Model Architectures
Our proposed model architectures differ in (a) which encoder states are considered in subsequent processing, (b) how they are combined, and (c) how they are used in the decoder. Table 1 summarizes all the examined configurations of RNN-based models. The first three (ATTN, FINAL, FINAL-CTX) correspond roughly to the standard sequence-to-sequence models, ), Sutskever et al. (2014 and , respectively. The last column (ATTN-ATTN) is our main proposed architecture: compound attention, described here in Section 3.1.
In addition to RNN-based models, we experiment with the Transformer model, see Section 3.3.
decoder encoder x 1 x 2 x 3 x T . . . Figure 1: An illustration of compound attention with 4 attention heads. The figure shows the computations that result in the decoder state s 3 (in addition, each state s i depends on the previous target token y i−1 ). Note that the matrix M is the same for all positions in the output sentence and it can thus serve as the source sentence representation.

Compound Attention
Our compound attention model incorporates attention in both the encoder and the decoder. Its architecture is shown in Fig. 1.
Encoder with inner attention. First, we process the input sequence x 1 , x 2 , . . . , x T using a bidirectional recurrent network with gated recurrent units (GRU, Cho et al., 2014): We denote by u the combined number of units in the two RNNs, i.e. the dimensionality of h t . Next, our goal is to combine the states (h 1 , h 2 , . . . , h T ) = H of the encoder into a vector of fixed dimensionality that represents the entire sentence. Traditional seq2seq models concatenate the final states of both encoder RNNs ( − → h T and ← − h 1 ) to obtain the sentence representation (denoted as FINAL in Table 1). Another option is to combine all encoder states using the average or maximum over time (Collobert and Weston, 2008;Schwenk and Douze, 2017) (AVGPOOL and MAXPOOL in Table 1 and following).
We adopt an alternative approach, which is to use inner attention 1 (Liu et al., 2016;Li et al., 2016) to compute several weighted averages of the encoder states (Lin et al., 2017). The main motivation for incorporating these multiple "views" of the state sequence is that it removes the need for the RNN cell to accumulate the representation of the whole sentence as it processes the input, and therefore it should have more capacity for modeling local dependencies.
Specifically, we fix a number r, the number of attention heads, and compute an r ×T matrix A of attention weights α jt , representing the importance of position t in the input for the j th attention head. We then use this matrix to compute r weighted sums of the encoder states, which become the rows of a new matrix M : (1) A vector representation of the source sentence (the "sentence embedding") can be obtained by flattening the matrix M . In our experiments, we project the encoder states h 1 , h 2 , . . . , h T down to a given dimensionality before applying Eq. (1), so that we can control the size of the representation. Following Lin et al. (2017), we compute the attention matrix by feeding the encoder states to a two-layer feed-forward network: where W and U are weight matrices of dimensions d × u and r × d, respectively (d is the number of hidden units); the softmax function is applied along the second dimension, i.e. across the encoder states.
Attentive decoder. In vanilla seq2seq models with a fixed-size sentence representation, the decoder is usually conditioned on this representation via the initial RNN state. We propose to instead leverage the structured sentence embedding by applying attention to its components. This is no different from the classical attention mechanism used in NMT , except that it acts on this fixed-size representation instead of the sequence of encoder states.
In the i th decoding step, the attention mechanism computes a distribution {β ij } r j=1 over the r components of the structured representation. This is then used to weight these components to obtain the context vector c i , which in turn is used to update the decoder state. Again, we can write this in matrix form as where B = (β ij ) T ,r i=1,j=1 is the attention matrix and C = (c i , c 2 , . . . , c T ) are the context vectors.
Note that by combining Eqs. (1) and (3), we get Hence, the composition of the encoder and decoder attentions (the "compound attention") defines an implicit alignment between the source and the target sequence. From this viewpoint, our model can be regarded as a restriction of the conventional attention model. The decoder uses a conditional GRU cell (cGRU att ; Sennrich et al., 2017), which consists of two consecutively applied GRU blocks. The first block processes the previous target token y i−1 , while the second block receives the context vector c i and predicts the next target token y i .

Constant Context
Compared to the FINAL model, the compound attention architecture described in the previous section undoubtedly benefits from the fact that the decoder is presented with information from the encoder (i.e. the context vectors c i ) in every decoding step. To investigate this effect, we include baseline models where we replace all context vectors c i with the entire sentence embedding (indicated by the suffix "-CTX" in Table 1). Specifically, we provide either the flattened matrix M (for models with inner attention; ATTN or ATTN-CTX), the final state of the encoder (FINAL-CTX), or the  result of mean-or max-pooling (*POOL-CTX) as a constant input to the decoder cell.

Transformer with Inner Attention
The Transformer (Vaswani et al., 2017) is a recently proposed model based entirely on feedforward layers and attention. It consists of an encoder and a decoder, each with 6 layers, consisting of multi-head attention on the previous layer and a position-wise feed-forward network.
In order to introduce a fixed-size sentence representation into the model, we modify it by adding inner attention after the last encoder layer. The attention in the decoder then operates on the components of this representation (i.e. the rows of the matrix M ). This variation on the Transformer model corresponds to the ATTN-ATTN column in Table 1 and is therefore denoted TRF-ATTN-ATTN.

Representation Evaluation
Continuous sentence representations can be evaluated in many ways, see e.g. Kiros et al. (2015), Conneau et al. (2017) or the RepEval workshops. 2 We evaluate our learned representations with classification and similarity tasks from SentEval (Section 4.1) and by examining clusters of sentence paraphrase representations (Section 4.2).

SentEval
We perform evaluation on 10 classification and 7 similarity tasks using the SentEval 3 (Conneau et al., 2017) evaluation tool. This is a superset of the tasks from Kiros et al. (2015).   Table 2 describes the classification tasks (number of classes, data size, task type and an example) and Table 3 lists the similarity tasks. The similarity (relatedness) datasets contain pairs of sentences labeled with a real-valued similarity score. A given sentence representation model is evaluated either by learning to directly predict this score given the respective sentence embeddings ("regression"), or by computing the cosine similarity of the embeddings ("similarity") without the need of any training. In both cases, Pearson and Spearman correlation of the predictions with the gold ratings is reported.
See Dolan et al. (2004) for details on MRPC and Hill et al. (2016) for the remaining tasks.

Paraphrases
We also evaluate the representation of paraphrases. We use two paraphrase sources for this purpose: COCO and HyTER Networks.
COCO (Common Objects in Context; Lin et al., 2014) is an object recognition and image captioning dataset, containing 5 captions for each image. We extracted the captions from its validation set to form a set of 5 × 5k = 25k sentences grouped by the source image. The average sentence length is 11 tokens and the vocabulary size is 9k types.
HyTER Networks (Dreyer and Marcu, 2014) are large finite-state networks representing a subset of all possible English translations of 102 Arabic and 102 Chinese sentences. The networks were built by manually based on reference sentences in Arabic, Chinese and English. Each network contains up to hundreds of thousands of possible translations of the given source sentence. We randomly generated 500 translations for each source sentence, obtaining a corpus of 102k sentences grouped into 204 clusters, each containing 500 paraphrases. The average length of the 102k English sentences is 28 tokens and the vocabulary size is 11k token types.
For every model, we encode each dataset to obtain a set of sentence embeddings with cluster labels. We then compute the following metrics: Cluster classification accuracy (denoted "Cl"). We remove 1 point (COCO) or half of the points (HyTER) from each cluster, and fit an LDA classifier on the rest. We then compute the accuracy of the classifier on the removed points.
Nearest-neighbor paraphrase retrieval accuracy (NN). For each point, we find its nearest neighbor according to cosine or L 2 distance, and count how often the neighbor lies in the same cluster as the original point.
Inverse Davies-Bouldin index (iDB). The Davies-Bouldin index (Davies and Bouldin, 1979) measures cluster separation. For every pair of clusters, we compute the ratio R ij of their combined scatter (average L 2 distance to the centroid) S i + S j and the L 2 distance of their centroids d ij , then average the maximum values for all clusters: The lower the DB index, the better the separation.
To match with the rest of our metrics, we take its inverse: iDB = 1 DB .

Experimental Results
We trained English-to-German and English-to-Czech NMT models using Neural Monkey 4 (Helcl and Libovický, 2017a). In the following, we distinguish these models using the code of the target language, i.e. de or cs.
4 https://github.com/ufal/neuralmonkey The de models were trained on the Multi30K multilingual image caption dataset (Elliott et al., 2016), extended by Helcl and Libovický (2017b), who acquired additional parallel data using backtranslation (Sennrich et al., 2016) and perplexitybased selection (Yasuda et al., 2008). This extended dataset contains 410k sentence pairs, with the average sentence length of 12 ± 4 tokens in English. We train each model for 20 epochs with the batch size of 32. We truecased the training data as well as all data we evaluate on. For German, we employed Neural Monkey's reversible pre-processing scheme, which expands contractions and performs morphological segmentation of determiners. We used a vocabulary of at most 30k tokens for each language (no subword units).
The cs models were trained on CzEng 1.7 (Bojar et al., 2016). 5 We used byte-pair encoding (BPE) with a vocabulary of 30k sub-word units, shared for both languages. For English, the average sentence length is 15 ± 19 BPE tokens and the original vocabulary size is 1.9M. We performed 1 training epoch with the batch size of 128 on the entire training section (57M sentence pairs).
The datasets for both de and cs models come with their respective development and test sets of sentence pairs, which we use for the evaluation of translation quality. (We use 1k randomly selected sentence pairs from CzEng 1.7 dtest as a development set. For evaluation, we use the entire etest.) We also evaluate the InferSent model 6 (Conneau et al., 2017) as pre-trained on the natural language inference (NLI) task. InferSent has been shown to achieve state-of-the-art results on the SentEval tasks. We also include a bag-ofwords baseline (GloVe-BOW) obtained by averaging GloVe 7 word vectors (Pennington et al., 2014).

Translation Quality
We estimate translation quality of the various models using single-reference case-sensitive BLEU (Papineni et al., 2002) as implemented in Neural Monkey (the reference implementation is mteval-v13a.pl from NIST or Moses).
Tables 4 and 5 provide the results on the two datasets. The cs dataset is much larger and the training takes much longer. We were thus able   to experiment with only a subset of the possible model configurations.
The columns "Size" and "Heads" specify the total size of sentence representation and the number of heads of encoder inner attention.
In both cases, the best performing is the ATTN Bahdanau et al. model, followed by Transformer (de only) and our ATTN-ATTN (compound attention). The non-attentive FINAL Cho et al. is the worst, except cs-MAXPOOL.
For 5 selected cs models, we also performed the WMT-style 5-way manual ranking on 200 sentence pairs. The annotations are interpreted as simulated pairwise comparisons. For each model, the final score is the number of times the model was judged better than the other model in the pair. Tied pairs are excluded. The results, shown in Table 5, confirm the automatic evaluation results.
We also checked the relation between BLEU and the number of heads and representation size. While there are many exceptions, the general tendency is that the larger the representation or the more heads, the higher the BLEU score. The Pearson correlation between BLEU and the number of heads is 0.87 for cs and 0.31 for de.

SentEval
Due to the large number of SentEval tasks, we present the results abridged in two different ways: by reporting averages (Table 6) and by showing only the best models in comparison with other methods ( Table 7). The full results can be found in the supplementary material. Table 6 provides averages of the classification and similarity results, along with the results of selected tasks (SNLI, SICK-E). As the baseline for classifications tasks, we assign the most frequent class to all test examples. 8 The de models are generally worse, most likely due to the higher OOV rate and overall simplicity of the training sentences. On cs, we see a clear pattern that more heads hurt the performance. The de set has more variations to consider but the results are less conclusive.
For the similarity results, it is worth noting that cs-ATTN-ATTN performs very well with 1 attention head but fails miserably with more heads. Otherwise, the relation to the number of heads is less clear. Table 7 compares our strongest models with other approaches on all tasks. Besides InferSent and GloVe-BOW, we include SkipThought as evaluated by Conneau et al. (2017), and the NMTbased embeddings by Hill et al. (2016) trained on the English-French WMT15 dataset (this is the best result reported by Hill et al. for NMT).
We see that the supervised InferSent clearly outperforms all other models in all tasks except for MRPC and TREC. Results by Hill et al. are always lower than our best setups, except MRPC. On classification tasks, our models are outperformed even by GloVe-BOW, except for the NLI tasks (SICK-E and SNLI) where cs-FINAL-CTX is better. Table 6 also provides our measurements based on sentence paraphrases. For paraphrase retrieval  AvgAcc is the average of all 10 SentEval classification tasks (see Table S1 in supplementary material), AvgSim averages all 7 similarity tasks (see Table S2). Hy-and CO-stand for HyTER and COCO, respectively. "H." is the number of attention heads. We give the out-of-vocabulary (OOV) rate and the perplexity of a 4-gram language model (LM) trained on the English side of the respective parallel corpus and evaluated on all available data for a given task.  (NN), we found cosine distance to work better than L 2 distance. We therefore do not list or further consider L 2 -based results (except in the supplementary material). This evaluation seems less stable and discerning than the previous two, but we can again confirm the victory of InferSent followed by our nonattentive cs models. cs and de models are no longer clearly separated.

Discussion
To assess the relation between the various measures of sentence representations and translation quality as estimated by BLEU, we plot a heatmap of Pearson correlations in Fig. 2. As one example, Fig. 3 details the cs models' BLEU scores and AvgAcc (average of SentEval accuracies).
A good sign is that on the cs dataset, most metrics of representation are positively correlated (the pairwise Pearson correlation is 0.78 ± 0.32 on average), the outlier being TREC (−0.16 ± 0.16 correlation with the other metrics on average) On the other hand, most representation metrics correlate with BLEU negatively (−0.57±0.31) on cs. The pattern is less pronounced but still clear also on the de dataset.
A detailed understanding of what the learned representations contain is difficult. We can only speculate that if the NMT model has some capability for following the source sentence superficially, it will use it and spend its capacity on closely matching the target sentences rather than on deriving some representation of meaning which would reflect e.g. semantic similarity. We assume that this can be a direct consequence of NMT being trained for cross entropy: putting the exact word forms in exact positions as the target sentence requires. Performing well in single-reference BLEU is not an indication that the system understands the meaning but rather that it can maximize the chance of producing the n-grams required by the reference.
The negative correlation between the number of attention heads and the representation metrics from Fig. 3 (−0.81 ± 0.12 for cs and −0.18 ± 0.19 for de, on average) can be partly explained by the following observation. We plotted the induced alignments (e.g. Fig. 4) and noticed that the heads tend to "divide" the sentence into segments. While one would hope that the segments correspond to some meaningful units of the sentence (e.g. subject, predicate, object), we failed to find any such interpretation for ATTN-ATTN and for cs models in general. Instead, the heads divide the source sentence more or less equidistantly, as documented by Fig. 5. Such a multi-headed sentence representation is then less fit for representing e.g. paraphrases where the subject and object swap their position due to passivization, because their representations are then accessed by different heads, and thus end up in different parts of the sentence embedding vector.  For de-ATTN-CTX models, we observed a much flatter distribution of attention weights for each head and, unlike in the other models, we were often able to identify a head focusing on the main verb. This difference between ATTN-ATTN and some ATTN-CTX models could be explained by the fact that in the former, the decoder is oblivious to the ordering of the heads (because of decoder attention), and hence it may not be useful for a given head to look for a specific syntactic or semantic role.

Conclusion
We presented a novel variation of attentive NMT models Vaswani et al., 2017) that again provides a single meeting point with a continuous representation of the source sen- tence. We evaluated these representations with a number of measures reflecting how well the meaning of the source sentence is captured.
While our proposed "compound attention" leads to translation quality not much worse than the fully attentive model, it generally does not perform well in the meaning representation. Quite on the contrary, the better the BLEU score, the worse the meaning representation.
We believe that this observation is important for representation learning where bilingual MT now seems less likely to provide useful data, but perhaps more so for MT itself, where the struggle towards a high single-reference BLEU score (or even worse, cross entropy) leads to systems that refuse to consider the meaning of the sentence.  Table S1: Classification accuracy of sentence representations on a number of SentEval tasks. Reprinted results are marked with †, others are our measurements. We give the out-of-vocabulary (OOV) rate and the perplexity of a 4-gram language model (LM) trained on the English side of the respective parallel corpus and evaluated on all available data for a given task. "AvgAcc" is the average of each row.  Table S2: Similarity scores of sentence representations: Pearson/Spearman correlation between cosine similarity of pairs of sentence embeddings and similarity as marked manually by humans. Reprinted results are marked with †, others are our measurements. "AvgSim" averages both correlation coefficients for all tasks. Perplexity and OOV rate as in Table S1.