Retrofitting Contextualized Word Embeddings with Paraphrases

Contextualized word embeddings, such as ELMo, provide meaningful representations for words and their contexts. They have been shown to have a great impact on downstream applications. However, we observe that the contextualized embeddings of a word might change drastically when its contexts are paraphrased. As these embeddings are over-sensitive to the context, the downstream model may make different predictions when the input sentence is paraphrased. To address this issue, we propose a post-processing approach to retrofit the embedding with paraphrases. Our method learns an orthogonal transformation on the input space of the contextualized word embedding model, which seeks to minimize the variance of word representations on paraphrased contexts. Experiments show that the proposed method significantly improves ELMo on various sentence classification and inference tasks.


Introduction
Contextualized word embeddings have shown to be useful for a variety of downstream tasks (Peters et al., 2018(Peters et al., , 2017McCann et al., 2017). Unlike traditional word embeddings that represent words with fixed vectors, these embedding models encode both words and their contexts and generate context-specific representations. While contextualized embeddings are useful, we observe that a language model-based embedding model, ELMo (Peters et al., 2018), cannot accurately capture the semantic equivalence of contexts. Specifically, in cases where the contexts of a word have equivalent or similar meanings but are changed in sentence formation or word order, ELMo may assign very different representations to the word. Table 1 shows two examples, where ELMo generates very different representations for the boldfaced words under semantic equivalent contexts. * Both authors contributed equally to this work.  Table 1: L2 and Cosine distances between embeddings of boldfaced words. The distance between the shared word in the paraphrases is even greater than the distance between large and small in random contexts.
. Quantitatively, 28.3% of the shared words in the paraphrase sentence pairs on the MRPC corpus (Dolan et al., 2004) is larger than the average distance between good and bad in random contexts, and 41.5% of those exceeds the distance between large and small. As a result, the downstream model is not robust to paraphrasing and the performance is hindered.
Infusing the model with the ability to capture the semantic equivalence no doubt benefits semantic-oriented downstream tasks. Yet, finding an effective solution presents key challenges. First, the solution inevitably requires the embedding model to effectively identify paraphrased contexts. On top of that, the model needs to minimize the difference of a word's representations on paraphrased contexts, without compromising the varying representations on unrelated contexts. Moreover, the long training time prevents us from redesigning the learning objectives of contextualized embeddings and retraining the model.
To address these challenges, we propose a simple and effective paraphrase-aware retrofitting (PAR) method that is applicable to arbitrary pretrained contextualized embeddings. In particular, PAR prepends an orthogonal transformation layer to a contextualized embedding model. Without retraining the parameters of an existing model, PAR learns the transformation to minimize the differ-ence of the contextualized representations of the shared word in paraphrased contexts, while differentiating between those in other contexts. We apply PAR to retrofit ELMo (Peters et al., 2018) and show that the resulted embeddings provide more robust contextualized word representations as desired, which further lead to significant improvements on various sentence classification and inference tasks.

Related Work
Contextualized word embedding models have been studied by a series of recent research efforts, where different types of pre-trained language models are employed to capture the context information. CoVe (McCann et al., 2017) trains a neural machine translation model and extracts representations of input sentences from the source language encoder. ELMo (Peters et al., 2018) pretrains LSTM-based language models from both directions and combines the vectors to construct contextualized word representations. Recent studies substitute LSTMs with Transformers (Radford et al., 2018(Radford et al., , 2019Devlin et al., 2019). As shown in these studies, contextualized word embeddings perform well on downstream tasks at the cost of extensive parameter complexity and the long training process on large corpora (Strubell et al., 2019). Retrofitting methods have been used to incorporate semantic knowledge from external resources into word embeddings (Faruqui et al., 2015;Yu et al., 2016;Glavaš and Vulić, 2018). These techniques are shown to improve the characterization of word relatedness and the compositionality of word representations. To the best of our knowledge, none of the previous approaches has been applied in contextualized word embeddings.

Paraphrase-Aware Retrofitting
Our method, illustrated in Figure 1, integrates the constraint of the paraphrased context into the contextualized word embeddings by learning the orthogonal transformation on the input space.

Contextualized Word Embeddings
We use S = (w 1 , w 2 , · · · , w l ) to denote a sequence of words of length l, where each word w belongs to the vocabulary V . We use boldfaced w ∈ R k to denote a k-dimensional input word embedding, which can be pre-trained or derived from a character-level encoder (e.g., the character-level  (Peters et al., 2018)). A contextualized embedding model E takes input vectors of the words in S, and computes the contextspecific representation of each word. The representation of word w specific to the context S is denoted as E(w, S).

Paraphrase-aware Retrofitting
PAR learns an orthogonal transformation M ∈ R k×k to reshape the input representation into a specific space, where the contextualized embedding vectors of a word in paraphrased contexts are collocated, while those in unrelated contexts are differentiated. Specifically, given two contexts S 1 and S 2 that both contain a shared word w, the contextual difference of a input representation w is defined by the L2 distance, Let P be the set of paraphrases on the training corpus, we minimize the following hinge loss (L H ).
(S 1 , S 2 ) ∈ P thereof is a pair of paraphrases in P .
(Ŝ 1 ,Ŝ 2 ) / ∈ P is a negative sample generated by randomly substituting either S 1 or S 2 with another sentence in the dataset that contains w. γ > 0 is a hyper-parameter representing the margin. The operator [x] + denotes max(x, 0).
The orthogonalization is realized by the following regularization term.
where · F denotes the Frobenius norm, and I is an identity matrix. The learning objective of PAR is then denoted as L = L H + λL O with a positive hyperparameter λ.
Orthogonalizing M has two important effects: (i) It preserves the word similarity captured by the original input word representation (Rothe et al., 2016); (ii) It prevents the model from converging to a trivial solution where all word representations collapse to the same embedding vector.

Experiment
Our method can be integrated with any contextualized word embedding models. In our experiment, we apply PAR on ELMo (Peters et al., 2018) and evaluate the quality of the retrofitted ELMo on a broad range of sentence-level tasks and the adversarial SQuAD corpus.

Experimental Configuration
We use the officially released 3-layer ELMo (original), which is trained on the 1 Billion Word Benchmark with 93.6 million parameters. We retrofit ELMo with PAR on the training sets of three paraphrase datasets: (i) MRPC contains 2,753 paraphrase pairs; (ii) Sampled Quora contains randomly sampled 20,000 paraphrased question pairs (Iyer et al., 2017); and (iii) PAN training set (Madnani et al., 2012) contains 5,000 paraphrase pairs.
The orthogonal transformation M is initialized as an identity matrix. In our preliminary experiments, we observed that SGD optimizer is more stable and less likely to quickly overfit the training set than other optimizers with adaptive learning rates (Reddi et al., 2018;Kingma and Ba, 2015). Therefore, we use SGD with the learning rate of 0.005 and a batch size of 128. To determine the terminating condition, we train a Multi-Layer Perceptron (MLP) classifier on the same paraphrase training set and terminate training based on the paraphrase identification performance on a set of held-out paraphrases. The sentence in the dataset is represented by the average of the word embeddings. λ is selected from {0.1, 0.5, 1, 2} and γ from {1, 2, 3, 4} based on validation set. The best margin γ and epochs ζ by early stopping are {γ = 3, ζ = 20} on MRPC, {γ = 2, ζ = 14} on PAN, and {γ = 3, ζ = 10} on Sampled Quora, with λ = 1 in all settings.

Evaluation
We use the SentEval framework (Conneau and Kiela, 2018) to evaluate the sentence embeddings on a wide range of sentence-level tasks. We consider two baselines models: (1) ELMo (all layers) constructs a 3,074-dimensional sentence embedding by averaging the hidden states of all the language model layers.
(2) ELMo (top layers) encodes a sentence to a 1,024 dimensional vector by averaging the representations of the top layer. We compare these baselines with four variants of PAR built upon ELMo (all layers) that trained on different paraphrase corpora.

Task Descriptions
Sentence classification tasks. We evaluate the sentence embedding on four sentence classification tasks including two sentiment analysis (MR (Pang and Lee, 2004), SST-2 (Socher et al., 2013)), product reviews (CR (Hu and Liu, 2004)), and opinion polarity (MPQA (Wiebe et al., 2005)). These tasks are all binary classification tasks. We employ a MLP with a single hidden layer of 50 neurons to train the classifer, using a batch size of 64 and Adam optimizer. Sentence inference tasks We consider two sentence inference tasks: paraphrase identification on MRPC (Dolan et al., 2004) and the textual entailment on SICK-E (Marelli et al., 2014). MRPC consists of pairs of sentences, where the model aims to classify if two sentences are semantically equivalent. The SICK dataset contains 10,000 English sentence pairs annotated for relatedness in meaning and entailment. The aim of SICK-E is to detect discourse relations of entailment, contradiction and neutral between the two sentences. Similar to the sentence classification tasks, we apply a MLP with the same hyperparameters to conduct the classification. Semantic textual similarity tasks. Semantic Textual Similarity (STS-15 (Agirre et al., 2015) and STS-16 (Agirre et al., 2016)) measures the degree of semantic relatedness of two sentences based on human-labeled scores from 0 to 5. We report the Pearson correlation between cosine similarity of two sentence representations and normalized human-label scores. Semantic relatedness tasks The semantic relatedness tasks include SICK-R (Marelli et al., 2014) and the STS Benchmark dataset (Cer et al., 2017), which comprise pairs of sentences annotated with   semantic scores between 0 and 5. The goal of the tasks is to measure the degree of semantic relatedness between two sentences. We learn treestructured LSTM (Tai et al., 2015) to predict the probability distribution of relatedness scores. Adversarial SQuAD The Stanford Question Answering Datasets (SQuAD) (Rajpurkar et al., 2016) is a machine comprehension dataset containing 107,785 human-generated reading comprehension questions annotated on Wikipedia articles. Adversarial SQuAD (Jia and Liang, 2017) appends adversarial sentences to the passage in the SQuAD dataset to study the robustness of the model. We conduct evaluations on two Adversarial SQuAD datasets: AddOneSent which adds a random human-approved sentence, and AddSent which adds grammatical sentences that look similar to the question. We train the Bi-Directional Attention Flow (BiDAF) network (Seo et al., 2017) with self-attention and ELMo embeddings on the SQuAD dataset and test it on the adversarial SQuAD datasets.

Result Analysis
The results reported in Table 2 show that PAR leads to 2% ∼ 4% improvement in accuracy on sentence classification tasks and sentence inference tasks. It leads to 0.03 ∼ 0.04 improvement in Pearson correlation (ρ) on semantic relatedness and textual similarity tasks. The improvements on sentence similarity and semantic relatedness tasks shows that ELMo-PAR is more stable to the semantic-preserving modifications but more sensitive to subtle yet semantic-changing perturba-tions. PAR model trained on the combined corpus (PAN+MRPC+Sampled Quora) achieves the best improvement across all these tasks, showing the model benefits from a larger paraphrase corpus. Besides sentence-level tasks, Table 3 shows that the proposed PAR method notably improves the performance of a downstream question-answering task. For AddSent, ELMo-PAR achieves 40.8% in EM and 47.1% in F1. For AddOneSent, it boosts EM to 51.6% and F1 to 57.9%, which clearly shows that the proposed PAR method enhances the robustness of the downstream model combined with ELMo.

Case Study
Shared word distances We compute the average embedding distance of shared words in paraphrase and non-paraphrase sentence pairs from test sets of MRPC, PAN, and Quora. Results are listed in Table 4. Table 5 shows the ELMo-PAR embedding distance for the shared words in the examples in Table 1. Our model effectively minimizes the embedding distance of the shared words in the paraphrased contexts and maximize such distance in the non-paraphrased contexts.

Conclusion
We propose a method for retrofitting contextualized word embeddings, which leverages semantic equivalence information from paraphrases. PAR learns an orthogonal transformation on the input space of an existing model by minimizing the contextualized representations of shared words on paraphrased contexts without compromising the varying representations on non-paraphrased contexts. We demonstrate the effectiveness of this method applied to ELMo by a wide selection of semantic tasks. We seek to extend the use of PAR to other contextualized embeddings (Devlin et al., 2019;McCann et al., 2017) in future work.