Retrofitting Sense-Specific Word Vectors Using Parallel Text

Jauhar et al. (2015) recently proposed to learn sense-speciﬁc word representations by “retroﬁtting” standard distributional word representations to an existing ontology. We observe that this approach does not require an ontology


Introduction
Vector space models (VSMs) provide a powerful tool for representing word meanings and modeling the relations between them. While these models have demonstrated impressive success in capturing some aspects of word meaning (Landauer and Dumais, 1997;Turney et al., 2010;Mikolov et al., 2013;Levy et al., 2014), they generally fail to capture the fact that single word forms often have multiple meanings. This can lead to counterintuitive results-for example, it should be possible for the nearest word to rock to be stone in everyday usage, punk in discussions of music, and crack (cocaine) in discussions about drugs.
In a recent paper, Jauhar et al. (2015) introduce a method for "retrofitting" generic word vectors to create sense-specific vectors using the WordNet semantic lexicon (Miller, 1995). From WordNet, they create a graph structure comprising two classes of relations: form-based relations between each word form and its respective senses, and meaning-based relations between word senses with similar meanings. This graph structure is then used to transform a traditional VSM into an enriched VSM, where each point in the space represents a word sense, rather than a word form. This approach is appealing as, unlike with prior sense-aware representations, senses are defined categories in a semantic lexicon, rather than clusters induced from raw text (Reisinger and Mooney, 2010;Huang et al., 2012;Neelakantan et al., 2015;Tian et al., 2014), and the method does not require performing word sense disambiguation (Guo et al., 2014).
In this paper, we observe that the crucial meaning relationships in the Jauhar et al. retrofitting process-the word sense graph-can be inferred based on another widely available resource: bilingual parallel text. This observation is grounded in a well-established tradition of using cross-language correspondences as a form of sense annotation (Gale et al., 1992;Diab and Resnik, 2002;Ng et al., 2003;Carpuat and Wu, 2007;Lefever and Hoste, 2010, and others). Using parallel text to define sense distinctions sidesteps the persistent difficulty of identifying a single correct sense partitioning based on human intuition, and avoids large investments in manual curation or annotation.
We use parallel text and word alignment to infer both word sense identities and inter-sense relations required for the sense graph, and apply the approach of Jauhar et al. to retrofit existing word vector representations and create a sense-based vec-tor space, using bilingual correspondences to define word senses. When evaluated on semantic judgment tasks, the vector spaces derived from this graph perform comparably to and sometimes better than the WordNet-based space of Jauhar et al., indicating that parallel text is a viable alternative to WordNet for defining graph structure. Combining the output of parallel-data-based and WordNet-based retrofitted VSMs consistently improves performance, suggesting that the different sense graph methods make complementary contributions to this sense-specific retrofitting process.

Model
Retrofitting. The technique introduced by Jauhar et al. (2015) is based on what we will call a sense graph, which we formulate as follows. Nodes in the sense graph comprise the words w i in a vocabulary W together with the senses s ij for those words. Labeled, undirected edges include word-sense edges w i , s i,j , which connect each word to all of its possible senses, and sense-sense edges s ij , s i j labeled with a meaning relationship r that holds between the two senses. Jauhar et al. use WordNet to define their sense graph. Synsets in the WordNet ontology define the sense nodes, a word-sense edge exists between any word and every synset to which it belongs, and WordNet's synset-to-synset relations of synonymy, hypernymy, and hyponymy define the sense-sense edges. Figure 1 illustrates a fragment of a WordNetbased sense graph, suppressing edge labels.
Adopting Jauhar et al.'s notation, the original vector space to be retrofitted is defined by the original word-form vectorsû i for each w i ∈ W , and the goal is to infer a set V of sense-specific vectors v ij corresponding to each sense s ij . Jauhar et al. use the sense graph to define a Markov network with variables for all word vectors and sense vectors, within which each word's vectorû i is connected to all of its sense vectors v ij , and the variables for sense vectors v ij and v i j are connected iff the corresponding senses are connected in the sense graph.
Retrofitting then consists in optimizing the following objective, where α is a sense-agnostic weight, and β r are relation-specific weights for types of relations between senses: The objective encourages similarity between a word's vector and its senses' vectors (first term), as well as similarity between the vectors for senses that are related in the sense graph (second term).
Defining a sense graph from parallel text. Our key observation is that, although Jauhar et al. (2015) assume their sense graph to be an ontology, this graph can be based on any inventory of word-sense and sense-sense relationships. In particular, given a parallel corpus, we can follow the tradition of translation-as-sense-annotation: the senses of an English word type can be defined by different possible translations of that word in another language.
Operationalizing this observation is straightforward, given a word-aligned parallel corpus. If English word form e i is aligned with Chinese word form c j , then e i (c j ) is a sense of e i in the sense graph, and there is a word-sense edge e i , e i (c j ) . Edges signifying a meaning relation are drawn between sense nodes if those senses are defined by the same translation word. For instance, English senses swear(发 誓) and vow(发 誓) both arise via alignment to 发誓 (fashi), so a sense-sense edge will be drawn between these two sense nodes. See Figure 2 for illustration.

Evaluation
Tasks. We evaluate on both the synonym selection and word similarity rating tasks used by Jauhar et al. Synonym selection nicely demonstrates the advantages afforded by sense partitioning: if we believe that spin means "make up a story", then we are not likely to perform well on a question in which the correct synonym is twirl. Word similarity rating, on the other hand, is a classic test of the extent to which vector representations simulate human intuitions of word relations in general.
For synonym selection, we follow Jauhar et al. in testing with ESL-50 (Turney, 2001), RD-300 (Jarmasz and Szpakowicz, 2004), and TOEFL-80 (Landauer and Dumais, 1997), using maxSim for multi-   (Rubenstein and Goodenough, 1965), MC-30 (Miller and Charles, 1991), and the designated test subset (1000 items) of MEN-3k (Bruni et al., 2014), using avgSim (Jauhar et al., 2015, eq. 8) as the similarity rating, and evaluating model ratings against human similarity ratings via Spearman's rank correlation coefficient (ρ). 2 Initial word representations. We use the word2vec (Mikolov et al., 2013) skip-gram architecture to train 80-dimensional word vectors (in keeping with Jauhar et al.), based on evidence that this model shows consistently strong performance on a wide array of tasks Levy et al., 2015). Training is on ukWaC (Ferraresi et al., 2008), a diverse 2B-word web corpus. 3 Sense-graph construction from parallel text. To construct the sense graph per Section 2, we use ∼5.8M lines of segmented Chinese-English parallel text from the DARPA BOLT project and the Broadcast Conversation subset of the segmented Chinese-English parallel data in the OntoNotes corpus (Weischedel et al., 2013). 4 We perform word alignment with the Berkeley aligner (Liang et al., 2006). We filter out noisy alignments using the Gtest statistic (Dunning, 1993), with a threshold selected during tuning on a development set.
We set α (see Equation 1) to 1.0. Each sensesense edge e i (c j ), e i (c j ) has individual weight 0 < β r ≤ 1, computed by obtaining the G-test statistic for the alignment of e i with c j and for the alignment of e i with c j , running these values through a logistic function, and averaging. Parameters for these computations, as well as the G-test statistic threshold below which we filtered out noisy alignments, were selected during tuning on the development set.
Note that we have not currently incorporated special treatment for alignments of a single word to a multi-word phrase. This does create the possibility of noisy or uninformative sense annotations (e.g., sense annotations corresponding to parts of aligned Chinese phrases) when such alignments are not filtered out by the G-test thresholding.
Experimental conditions. We evaluate the following experimental conditions: Skip-gram (SG) uses the un-retrofitted word2vec vectors, Word-Net (WN) retrofits using the WordNet-based sense graph, and Parallel Data (PD) retrofits using the sense graph built from parallel text. We also combine the two retrofitting approaches (PD-WN). For synonym selection, we compute maxSim over all sense pairs for WN and PD separately, and select the sense pair with the overall maximum cosine similarity across the two. For similarity rating, we explore two PD-WN combination approaches: for each word pair, we take the avgSim from each separate model, and then we (a) take the average of the values given by the two models (avg), or (b) take the maximum value between the two models (max). Table 1 shows that combining our new method with Jauhar et al.'s WN retrofitting performs best on synonym selection across all datasets, and both retrofitted models consistently outperform the noretrofitting model (SG). Error analysis on RD-87, the only set on which WN substantially outperforms PD, suggests that PD's errors are driven by the large number of lower frequency items that characterize this dataset. Given that WordNet is a hand-curated lexicon while the parallel data mirrors actual usage, it is not surprising that the latter suffers when it comes to low frequency items.

Results
Error analysis also indicates that PD performs particularly well on the synonym task precisely when one would expect: when the probe and the correct answer have an alignment to the same Chinese word form, so that the corresponding sense vectors are extremely close in vector space. Occasionally, PD yields "the wrong answer for the right reason", choosing an option for which there is indeed a correct alignment that matches an alignment of the probe word. For instance, though the probe passage is intended to have the answer hallway, PD chooses ticket because both passage and ticket have a sense defined by alignment to the Chinese word 机票 (jipiao), meaning "air ticket". Though this is a less frequent sense of passage, it is a reasonable one.
Results on the similarity rating task (presented in Table 2) are less clearly interpretable, top performance being divided between the PD model and the combined models-with the exception of WS-353. We note that WS-353 is a test set for which human  raters were explicitly told to rate relatedness, rather than similarity, while the retrofitting process is intended to encourage similarity per se. If we exclude this set from consideration, we can observe that SG is outperformed by at least one sense-specific model in all cases. 5 Note that as expected, the amount of training data has an impact on the quality of the alignments and of the sense graph. Retrofitting sense-specific embeddings using only 300k sentence pairs, which represent about 5% of the total training data, does not give clear benefit over word-form embeddings.

Conclusions and future work
Building on Jauhar et al. (2015), we have presented an alternative means of deriving information about senses and sense relations to build sense-specific vector space representations of words, making use of parallel text rather than a manually constructed ontology. We show that this is a viable alternative, producing representations that perform on par with those retrofitted to sense graphs based on  Based on these results, it would be interesting to evaluate further refinements of the sense graph: alignment-based senses could be clustered, or further filtered to reduce the impact of alignment noise; new edges could be added using other multilingual resources. Finally, it will be important to evaluate the effectiveness of the retrofitted word embeddings on extrinsic tasks that require disambiguating word meaning in context.