Learning Monolingual Compositional Representations via Bilingual Supervision

Bilingual models that capture the semantics of sentences are typically only evaluated on cross-lingual transfer tasks such as cross-lingual document categorization or machine translation. In this work, we evaluate the quality of the monolingual representations learned with a variant of the bilingual compositional model of Hermann and Blunsom (2014), when viewing translations in a second language as a semantic annotation as the original language text. We show that compositional objectives based on phrase translation pairs out-perform compositional objectives based on bilingual sentences and on monolingual paraphrases.


Introduction
The effectiveness of new representation learning methods for distributional word representations  has brought renewed interest to the question of how to compose semantic representations of words to capture the semantics of phrases and sentences. These representations offer the promise of capturing phrasal or sentential semantics in a general fashion, and could in principle benefit any NLP applications that analyze text beyond the word level, and improve their ability to generalize beyond contexts seen in training.
While most prior work has focused either on composing words into short phrases (Mitchell and Lapata, 2010;Baroni and Zamparelli, 2010;Hermann et al., 2012;Fyshe et al., 2015), or on supervised task-specific composition functions Iyyer et al., 2015;Rocktäschel et al., 2015;Iyyer et al., 2014;Tai et al., 2015, inter alia), Wieting et al. (2016) recently showed that a simple composition architecture (vector averaging) can yield sentence models that consistently perform well in semantic textual similarity tasks in a wide range of domains, and outperform more complex sequence models (Tai et al., 2015). Interestingly, these models are trained using PPDB, the paraphrase database (Ganitkevitch et al., 2013), which was learned from bilingual parallel corpora.
In bilingual settings, there are also a few examples of bilingual sentence models (Zou et al., 2013;Hermann and Blunsom, 2014;Lauly et al., 2014;Gouws et al., 2014). However, they have only been evaluated in cross-lingual transfer settings (e.g., cross-lingual document classification, or machine translation), which do not directly evaluate the quality of the sentence-level semantic representations learned.
In this work, we directly evaluate the usefulness of modeling semantic equivalence using compositional models of translated texts for detecting semantic textual similarity in a single language. For instance, in addition to using translated texts to model cross-lingual transfer from English to a foreign language, we can view English translations as a semantic annotation of the foreign text, and evaluate the usefulness of the resulting foreign representations. While learning representations in languages other than English is a pressing practical problem, this paper will focus on evaluating English sentence representations learned on English semantic similarity tasks to facilitate comparison with prior work.
Our results show that sentence representations learned using a bilingual compositional objective outperform representations learned using monolingual evidence, whether compositional or not. In addition, phrasal translations yield better representations than full sentence translations, even when applied to sentence-level tasks. as que podramos decir , de hecho, que se adelant a la decisin de nuestro colega.
thus, in fact, we might say that he hurried ahead of the decision by our fellow member. seor presidente, la votacin sobre sellafield ha sido una novedad en el parlamento europeo . English paraphrases + by our fellow member by our colleague by our fellow member of the committee's work + slowly than anticipated slowly than expected Bilingual phrases + by our fellow member de nuestro colega diputado by our fellow member miles de personas de todo + book and buy airline tickets reserva y adquisicin de billetes + the air fare advertised should show el precio del billete anunciado debera indicar + a book by the american writer noam un libro del escritor norteamericano noam

Models
Inspired by the bilingual model of (Hermann and Blunsom, 2014), and paraphrase model of (Wieting et al., 2016), representations for multi-word segments are built with a simple bag-of-word additive combination of word representations, which are trained to minimize the distance between semantically equivalent segments.

Three Views of Semantic Equivalence
The different types of semantic equivalence used for training are illustrated in Table 1.
Parallel Sentences occur naturally, and provide training examples that are more consistent with downstream applications. However, they can be noisy due to automatic sentence alignment and one-to-many mappings, and bag-of-word representations of sentence meaning are likely to be increasingly noisier as segments get longer.
Monolingual Paraphrases are invaluable resources, but rarely occur naturally , and creating paraphrase resources therefore requires considerable effort. Ganitkevitch et al. (2013) automatically-created paraphrase resources for many languages using parallel corpora.
Parallel Phrases or phrasal translations might provide a tighter definition of semantic equivalence than longer sentence pairs, but phrase pairs have to be extracted automatically based on word alignments, an automatic and noisy process.

Models and Learning Objectives
Our main model is based on the bilingual composition model of Hermann and Blunsom (2014), which learns a word embedding matrix W from a training set X of aligned sentence pairs x 1 , x 2 . Each of x 1 and x 2 is represented as a bag-ofwords, i.e. a superset of column indices in W . Each aligned pair x 1 , x 2 is augmented with k randomly selected sentences that are not aligned to x 1 , and another k that are not aligned to x 2 . Given this augmented example x 1 , x 2 ,x 1 1 , ...,x k 1 ,x 1 2 , ...,x k 2 , the model training objective is defined as follows: [v] h = max(0, v)) whose margin is given by δ and λ is a regularization parameter.
The paraphrase-based model of Wieting et al. (2016) shares the same structure as the bilingual model above, but differs in the nature of segments used to define semantic equivalence (sentence pairs vs. paraphrases), the distance function used (Euclidean distance vs. cosine similarity), as well as the negative sampling strategies, and word embeddings initialization and regularization. We  (Ganitkevitch et al., 2013) provide empirical comparisons with the Wieting et al. (2016) embeddings, and also define a simplified version of that objective, J pa , to allow for controlled comparisons with J bi . J pa uses random initialization and penalizes large values in W with a ||W || 2 F regularization term 1 . The choice of distance function (Euclidean distance or cosine similarity) and of the negative sampling strategy 2 are viewed as tunable hyperparameters.

Evaluating Sentence Representations
Following Wieting et al. (2016), the models above are evaluated on the four Semantic Textual Similarity (STS) datasets (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015), which provide pairs of English sentences from different domains (e.g., Tweets, news, webforums, image captions), annotated with human judgments of similarity on a 1 to 5 scale. Systems have to output a similarity score for each pair. Systems are evaluated using the Pearson correlation between gold and predicted rankings.
The Sentences Involving Compositional Knowledge (SICK) test set (Marelli et al., 2014) provides a complementary evaluation. It consists of sentence pairs annotated with semantic relatedness scores. While STS examples were simply drawn from existing NLP datasets, SICK examples were constructed to avoid non-compositional phenomena such as multiword expressions, named entities and world knowledge.

Experimental Conditions
At training time we learn word embeddings for each combination of objective (Section 2.2) and type of training examples (Table 2), using modified implementations of open-source implementations for J bi (Hermann and Blunsom, 2014) and J pa (Wieting et al., 2016). This results in six model configurations. Each was trained for 10 epochs using tuned hyperparameters.
Tuning results confirmed the importance of negative sampling and distance function in our models: in J bi , increasing k consistently helps the bilingual models, whereas the correlation score for monolingual models degrade for k > 10. In J pa , M AX always outperforms M IX. Euclidean distance was consistently chosen for bilingual sentences and monolingual phrases, while cosine similarity was chosen for bilingual phrases.
At test time we construct sentence-level embeddings by averaging the representations of words in each sentence, and compute cosine similarity to capture the similarity between sentences. Table 3 reports the Pearson correlation scores achieved for each approach and dataset.

Bilingual phrases yield the best models in controlled settings
Overall, the best representations are obtained using bilingual phrase pairs and the J bi objective. They outperform all other compositional models for all tasks, except for one subset of STS-2015.
The best objective for a given type of training example varies: J pa generally yields better Table 3: Pearson correlation scores obtained on the English STS sets (with per year averages) and on semantic-relatedness task (SICK). The left columns report results based on new representations learned in this work, while the 2 rightmost columns report reference results from prior work (Wieting et al., 2016).

Monolingual Phrases Bilingual Phrases Bilingual Sentences
Reference results with monolingual phrases, while J bi performs better with bilingual examples. Bilingual phrases seem to benefit from larger number of randomly selected negative samples and from using the Euclidean distance rather than cosine similarity. The best bilingual compositional representations are better than non-compositional Glove embeddings (Pennington et al., 2014), but worse than compositional Paragram embeddings (Wieting et al., 2016). However, Paragram initialization requires large amounts of text and human word similarity judgments for tuning, while our models were initialized randomly.

Bilingual sentences vs. bilingual phrases
Why do bilingual phrases outperform the bilingual sentences they are extracted from? In this section, we verify that this is not explained by systematic biases in the distribution of training examples. First, Table 4 shows that bilingual sentences have the smallest ratios of undertrained words, and are therefore not penalized by rare words more than bilingual phrases 3 .
Second, we see that the rankings are not biased due to memorization of the phrases seen during training. Rankings of models does not change when testing on unseen word sequences, as shown by SICK results with models trained using J bi on a filtered training set that contains none of the bigrams observed at test time (Table 5).
Third, the advantage of bilingual phrases over bilingual sentences is not due to the larger number of training examples. 1.9M (and even 1M ) bilin-  Table 6). Taken together, these additional results support our initial intuition that the main advantage of bilingual phrases over bilingual sentences is that phrase pairs have stronger semantic equivalence than sentence pairs, since phrase pairs are shorter and are constructed by identifying strongly aligned subsets of sentence pairs.

Monolingual vs. bilingual phrases
Based on the analysis thus far, we hypothesize that paraphrase pairs with overlapping tokens make the compositional training objective less useful. Around 40% of the paraphrase training pairs differ only by one token. With Euclidean distance in the training objective, overlapping tokens cancel each other out of the composition term. For example, the pair healthy and stable, healthy and steady yields the compositional term ||(healthy + and + stable)− (healthy + and + steady)|| 2 = ||stable − steady|| 2 In contrast, overlap cannot occur in the bilingual setting, and all words within bilingual phrases contribute to the compositional objective. Furthermore, bilingual pairs provide a more explicit semantic signal as translations can disambiguate polysemous words (Diab, 2004;Carpuat and Wu, 2007) and help discover synonyms by pivoting Yao et al., 2012).
All these factors might contribute to the ability of training with bilingual phrases of taking advantage of larger number of negative samples k.

Conclusion
We conducted the first evaluation of compositional representations learned using bilingual supervi- Phrase and sentence representations are constructed by composing word representations using a simple additive composition function. We considered two training objective that encourage the resulting representations to distinguish English-Spanish segment pairs that are semantically equivalent or not. The resulting English sentence representations consistently outperform compositional models trained to detect monolingual paraphrases on five different English semantic textual similarity tasks from SemEval.
Bilingual phrase pairs are consistently the best evidence of semantic equivalence in our experiments. They yield better results than the sentence pairs they are extracted from, despite the noise introduced by the automatic extraction process.
Furthermore the composed representations outperform non-compositional word representations derived from monolingual co-occurrence statistics. While sizes of monolingual vs. bilingual corpora are not directly comparable, it is remarkable that representations learned with only 500k bilingual phrase pairs outperform GloVe embeddings trained on 840B tokens.
Since our best models still underperform Paragram vectors, which require a more sophisticated initialization process, we will turn to improving our initialization strategies in future work. Nevertheless, current results provide further evidence of the usefulness of compositional text representations, even with a simple bag-of-word additive composition function, and of bilingual translation pairs as a strong signal of semantic equivalence.