Sense-Aware Statistical Machine Translation using Adaptive Context-Dependent Clustering

Statistical machine translation (SMT) systems use local cues from n-gram translation and language models to select the translation of each source word. Such systems do not explicitly perform word sense disambiguation (WSD), although this would enable them to select translations depending on the hypothesized sense of each word. Previous attempts to constrain word translations based on the re-sults of generic WSD systems have suffered from their limited accuracy. We demonstrate that WSD systems can be adapted to help SMT, thanks to three key achievements: (1) we consider a larger context for WSD than SMT can afford to consider; (2) we adapt the number of senses per word to the ones observed in the training data using clustering-based WSD with K-means; and (3) we initialize sense-clustering with deﬁnitions or examples extracted from WordNet. Our WSD system is competitive, and in combination with a factored SMT system improves noun and verb translation from English to Chinese, Dutch, French, German, and Spanish.


Introduction
Selecting the correct translation of polysemous words remains an important challenge for machine translation (MT).While some translation options may be interchangeable, substantially different senses of source words must generally be rendered by different words in the target language.In this case, an MT system should identify -implicitly or explicitly -the correct sense conveyed by each occurrence in order to select the appropriate translation.
Source: And I do really like this shot, because it shows all the detritus that's sort of embedded in the sole of the sneakers.
Baseline SMT: Und ich mag dieses Bild . . .Current statistical or neural MT systems perform word sense disambiguation (WSD) implicitly, for instance through the n-gram frequency information stored in the translation and language models.However, the context taken into account by an MT system when performing implicit WSD is limited.For instance, in the case of phrasebased SMT, it is the order of the language model (often between 3 and 5) and the size of n-grams in the phrase table (seldom above 5).In attentionbased neural MT systems, the context extends to the entire sentence, but is not specifically trained to be used for WSD.
For instance, Figure 1 shows an English sentence translated into German by a baseline statistical MT, an online neural MT, and the sense-aware MT system proposed in this paper.The word shot is respectively translated as Schuss (gun shot), Bild (drawing) and Aufnahme (picture) by online NMT, baseline and our sense-aware system.Our system selects a correct sense, which is identical to the reference, while the first two are incorrect (especially the online NMT).In this paper, we introduce a sense-aware statistical MT system that performs explicit WSD, and uses for it a larger context than is accessible to state-of-the-art SMT.Our WSD system performs context-dependent clustering of word occurrences and is initialized with knowledge from WordNet, in the form of vector representations of definitions or examples for each sense.The labels of the resulting clusters are used as abstract source-side sense labels within a factored phrase-based SMT system.The stages of our method are presented in Figure 2, and will be explained in detail in Section 3.
Our results (Section 5) show first that our WSD system is competitive on the SemEval 2010 WSD task, but especially that it helps SMT to increase its BLEU scores and to improve the translation of polysemous nouns and verbs, when translating from English into Chinese, German, French, Spanish or Dutch, in comparison to an SMT baseline that is not aware of word senses.
With respect to previous work that used WSD for MT, discussed in Section 2, we innovate on the following points: • we design a sense clustering method with explicit knowledge (WordNet definitions or examples) to disambiguate polysemous nouns and verbs; • we represent each token by its context vector, obtained from word2vec word vectors in a large surrounding window; • we adapt the possible number of senses per word to the ones observed in the training data rather than constraining them by the full list of senses from WordNet; • we use the abstract sense labels for each analyzed word as factors in an SMT system.

Related Work
Word sense disambiguation aims to identify the sense of a word appearing in a given context (Agirre and Edmonds, 2007).Resolving word sense ambiguities should be useful, in particular, for lexical choice in MT.An initial investigation found that an SMT system which makes use of off-the-shelf WSD does not yield significantly better quality translations than a SMT system not using it (Carpuat and Wu, 2005).However, another study (Vickrey et al., 2005) reformulated the task of WSD for SMT as predicting possible target translations rather than senses of ambiguous source words, and showed that WSD can improve such a simplified word translation task.Subsequent studies which adopted this formulation (Cabezas and Resnik, 2005;Chan et al., 2007;Carpuat and Wu, 2007), successfully integrated WSD to hierarchical or phrase-based SMT.These systems yielded slightly better translation quality compared to SMT baselines in most cases (0.15-0.30BLEU).
Although the WSD reformulation above proved helpful for SMT, still it did not answer whether actual source-side senses are helpful for end-to-end SMT.Xiong and Zhang (2014) attempted to answer this question by performing word sense induction for large scale data.In particular, they proposed a topic model that automatically learned sense clusters for words in the source language.In this way, on the one hand, they avoided using a pre-specified inventory of word senses as traditional WSD does, but on the other hand, they created the risk of discovering sense clusters which do not correspond to the common senses of words needed for MT.Hence, this study left open an important question, namely whether WSD based on semantic resources such as WordNet (Fellbaum, 1998) can be successfully integrated with SMT.Neale et al. (2016) attempted such an integration, by using a WSD system based on a sense graph from WordNet (Agirre and Soroa, 2009).This system detects the senses of words in context using a random walk algorithm over the sense graph.The authors used it to specify the senses of the source words and integrate them as contextual features with a MaxEnt-based translation model for English-Portuguese MT.Similarly, Su et al. (2015) built a large weighted graph model of both source and target word dependencies and integrated them as features to a SMT model.However, apart from the sense graph, WordNet provides also textual information such as sense definitions and examples, which should be useful for disambiguating senses, but were not used in the above studies.Here, we aim to exploit this information to perform word sense induction from large scale monolingual data (in a first phase), thus combining the benefits of semantic ontologies and word sense induction for WSD.
Several other studies integrate additional information from a larger context using factored-based MT models (Koehn and Hoang, 2007).Birch et al. (2007) integrated supertags from a Combinatorial Categorial Grammar as factors in phrasebased translation model.Avramidis and Koehn (2008) added source-side syntactic information for each word for translating from a morphologicallypoorer language to a richer one (English-Greek).The levels of improvement achieved with factored models such as the ones above range from 0.15 to 0.50 BLEU points.Here, we also observe improvements in the upper part of this range, and they are consistent across several language pairs.

Adaptive Sense Clustering for SMT
In this section, we describe our adaptive WSD method and show how we integrate it with SMT, as represented in Figure 2 above.In a nutshell, we consider all source words that have more than one sense (synset) in WordNet, and extract from Word-Net the definition of each sense and, if available, the example.We associate to them word embeddings built using word2vec.For each occurrence of these words in the training data, we also build vectors for their contexts (i.e.neighboring words) using the same model.All the vectors are passed to a clustering algorithm, resulting in the labeling of each occurrence with a cluster number that will be used as a factor in statistical MT.
Our method answers several limitations of previous supervised or unsupervised WSD methods.Supervised methods require data with manually sense-annotated labels and are therefore often limited to a small number of word types: for instance, only 50 nouns and 50 verbs were targeted in Se-mEval 20101 (Manandhar et al., 2010).On the contrary, our method does not require labeled texts for training, and applies to all word types appearing with multiple senses in WordNet.
Unsupervised methods often pre-define the number of possible senses for each ambiguous word before clustering the various occurrences according to the senses.If these numbers come from WordNet, the senses may be too fine-grained for the needs of translation, especially when a specific domain is targeted.In contrast, as we explain below, our WSD method initializes a contextdependent clustering algorithm with information from WordNet senses for each word (nouns and verbs), but then adapts the number of clusters to the observed training data for MT.

Representing Definitions, Examples and Contexts of Word Occurrences
For each noun or verb type W t appearing in the training data, as identified by the Stanford POS tagger,2 we extract the senses associated to it in WordNet3 by using its Web interface,4 specifically the definitions D t = {d tj |j = 1, . . ., m t } and examples of use E t = {e tj |j = 1, . . ., n t }, each of them containing multiple words.While most of the senses are accompanied by a definition, only a smaller subset also include an example of use, as it appears from the four last columns of Table 1, but some senses also contain examples without definitions.
Each definition d tj and example e tj is represented by a vector, which is the average of the word embeddings over all the words constituting them.Formally, these are d tj = ( w l ∈d tj w l )/m t and respectively e tj = ( w l ∈e tj w l )/n t .While the entire definition d tj is used to build the vector, we do not consider all words in the example e tj , but limit the sum to e tj , by considering only a window of size c centered around the noun or verb of type W t (similarly to the window used for context representation below) to avoid noise from potentially long examples.
For all the word vectors w l above, we use word2vec pre-trained embeddings from Google 5 (Mikolov et al., 2013).If d is the dimensionality of the word vector space, then all vectors w l , d tj , and e tj are in R d .Each definition vector d tj or example vector e tj for a word type W t will be considered as a center vector for each sense during the clustering procedure.
Similarly, each word token w i in a source sentence is represented by the average vector u i of the words in its context, which is defined as a window of c words centered in w i .The value c of the context size needs to be even, since we calculate the vector u i for w i by averaging vectors from c/2 words before w i and from c/2 words after it.We stop nevertheless at the sentence boundaries, and filter out stop words before averaging.
We will now explain how to cluster according to their senses all vectors u i for the occurrences w i of a given word type W t , using as initial centers either the definition or the example vectors.

Clustering Word Occurrences According to their Senses
We now aim to group all occurrences w i of a given word type W t into clusters according to the similarity of their senses, which we will model as the similarity of their context vectors.The correctness of this hypothesis will be supported by the empirical results.We will modify the k-means algorithm in several ways to achieve an optimal clustering of word senses for MT.The original k-means algorithm (MacQueen, 1967) aims to partition a set of items, which are here tokens w 1 , w 2 , . . ., w n of a same word type W t , represented through their embeddings u 1 , u 2 , . . ., u n where u i ∈ R d .The goal of k-means is to partition (or cluster) them into k sets S = {S 1 , S 2 , . . ., S k } so as to minimize the within-cluster sum of squares, as follows: where µ i is the centroid of each set S i .At the first iteration, when there are no clusters yet, the algorithm selects k random points to be the centroids of the k clusters.Then, at each subsequent 5 code.google.com/archive/p/word2vec/iteration t, k-means calculates for each candidate cluster a new point to be the centroid of the observations, defined as their average vector, as follows: We make the following modifications to the original k-means algorithm, to make it adaptive to the word senses observed in the training data.
1. We define the initial number of clusters k t for each ambiguous word type W t in the data as the number of its senses in WordNet (but this number will be possibly reduced by the final re-clustering described below at point 3).Specifically, we run two series of experiments (the results of which will be compared in Section 5.1.1):one in which each k t is set to m t , i.e. the number of senses that possess a definition in WordNet, and another one in which we consider only senses that are illustrated with an example, hence setting each k t to n t .These settings avoid fixing the number of clusters k t arbitrarily for each ambiguous word type.
2. We initialize the centroids of the clusters to the vectors representing the senses from WordNet, either using their definition vectors d tj in one series of experiments, or their example vectors e tj in the other one.This second modification attempts to provide a reasonably accurate starting point for the clustering process.
3. After running the k-means algorithm, we reduce the number of clusters for each word type by merging the clusters which contain fewer than 10 tokens with the nearest larger cluster.This is done by calculating the cosine similarity between each token vector u i and the centroids of the larger clusters and assigning the tokens to the closest large cluster.This re-clustering adapts the final number of clusters to the observed occurrences in the training data: indeed, when there are few occurrences of a sense for a given ambiguous word type in the data, the SMT is likely not able to translate them properly due to the lack of training samples.
Finally, after clustering the training data, we use the centroids to assign each new token from the test data to a cluster, i.e. an abstract sense label, by selecting the closest centroid to it in terms of cosine distance in the embedding space.

Integration with Machine Translation
Our adaptive WSD system assigns a sense number for each ambiguous word token in the source-side of a parallel corpus.To pass this information to an SMT system, we use a factored phrase-based translation model (Koehn and Hoang, 2007).The factored model offers a principled way to supplement words with additional information -such as, traditionally, part-of-speech tags -without requiring any intervention in the translation tables.The features are combined in a log-linear way with those of a standard phrase-based decoder, and the goal remains to find the most probable target sentence for a given source sentence.To each source noun or verb token, we add a sense label to it obtained from our adaptive WSD system.To all the other words, we assign a NULL label. 6The translation system will thus take the source-side sense labels into consideration during the training and the decoding processes.

Datasets, Preparation and Settings
We evaluate our sense-aware SMT on the UN Corpus7 (Rafalovitch and Dale, 2009) as well as on Europarl8 (Koehn, 2005).We select 0.5 million parallel sentences for each language pair from Europarl, as shown in Table 1.We also use the WIT3 Corpus9 (Cettolo et al., 2012), a smaller collection of transcripts of TED talks, to evaluate the impact of costly model choices, namely the type of the resource (definition vs. examples), the length of the context window, and the k-means method (adaptive vs. original).
Before assigning sense labels, we first tokenize all the texts and identify the parts of speech (POS) using the Stanford POS tagger10 .Then, we filter out stop words and nouns which are proper names according to the Stanford Name Entity Recognizer 1 .Furthermore, we convert the plural forms of nouns to their singular form and the verb forms to infinitive using the stemmer and lem-matizer from NLTK11 -this is essential because WordNet has description entries only for singular nouns and infinitive form of verbs.The preprocessed text is used for assigning sense labels to each occurrence of a noun or verb which has more than one sense in WordNet.For translation, we train and tune baseline and factored phrase-based models with Moses12 (Koehn et al., 2007).
We select the optimal model configuration based on the MT performance, measured with the traditional BLEU score (Papineni et al., 2002), on the WIT3 corpus for EN/ZH and EN/DE.Unless otherwise stated, we use the following settings in the k-means algorithm, starting from the implementation provided in Scikit-learn (Pedregosa et al., 2011): • we use the definition of each sense for initializing the centroids in the adaptive k-means methods (and compare this later with using the examples); • we also set k t equal to m t , i.e. the number of senses of an ambiguous word type W t ; • the window size for the context surrounding each occurrence is set to c = 8.
For the evaluation of intrinsic WSD performance, we use the V -metric, the F 1 -metric, and their average, as used for instance at SemEval 2010 (Manandhar et al., 2010).To measure the impact of WSD on MT, besides BLEU, we also measure the actual impact on the nouns and verbs that appear in WordNet with several senses, by comparing how many of them are translated as in the reference translation, by our system vs. the baseline.For a certain set of tokens in the source data, we note as N improved the number of tokens which are translated by our system as in the reference translation, but whose baseline translation differs from it.Conversely, we note as N degraded the number of tokens which are translated by the baseline system as in the reference, but differently by our system.We will use the normalized coefficient ρ = (N improved − N degraded )/T , where T is the total number of tokens, as a metric focusing explicitly on the words submitted to WSD.

Results
Using the data, settings, and metrics above, we investigate first the impact of two model choices on the performance: centroid initialization for kmeans (definition or examples vs. random), and the length of the context window for each word.Then, we evaluate our adaptive clustering method on the WSD task, to estimate its intrinsic quality, and finally measure WSD+MT performance.

Initialization of Adaptive k-means
We examine first the impact of the initialization of the sense clusters, on the WIT3 Corpus.In Table 2, we present the BLEU scores of our WSD+MT system in two conditions: when the kmeans clusters are initialized with vectors from the definitions vs. from the examples provided in the WordNet synsets of ambiguous words.Moreover, we provide BLEU scores of baseline and oracle (i.e.correct senses as factors) systems, as well as the ρ score indicating the relative improvement of ambiguous words in our system wrt.the baseline.The use of definitions outperforms the use of examples, probably because there are more words with definitions than with examples in WordNet (twice as many, as shown in Table 1 in Section 4), but also because definitions may provide more helpful words to build the initial vectors, as they are more explicit than the examples.All the values of ρ show clear improvements over the baseline, with up to 4% for DE/EN.As for the oracle scores, they outperform the baseline by a factor of 2-3 compared to our system.
In addition, we compare the two initialization options above with random initializations of kmeans clusters, in Table 3.To offer a fair comparison, we set the number of clusters, in the case of random initializations, respectively to the num-

Length of the Context Window
We now investigate the effect of the size of the context window surrounding each ambiguous token, i.e. the number of words surrounding it that are considered for building its vector representation.Figure 3 displays the BLEU score of our WSD+MT factored system when varying this size, on EN/ZH translation in the WIT3 Corpus, along with the (constant) score of the baseline.The performance of our system improves with the size of the window, reaching a peak around 8-10.This result highlights the importance of a longer context compared to the typical settings of SMT systems, which never go beyond 6 (order of language  (Korkontzelos and Manandhar, 2010) 15.70 20.60 8.50 49.80 38.20 66.60 32.75 29.40 37.50 11.54 KSU KDD (Elshamy et al., 2010) 15.70 18.00 12.40 36.90 24.60 54.70 26.30 21.30 33.50 17.50 Duluth-WSI (Pedersen, 2010) 9.00 11.40 5.70 41.10 37.10 46.70 25.05 24.20 26.20 4.15 Duluth-WSI-SVD-Gap (Pedersen, 2010) 0.00 0.00 0.10 63.30 57.00 72.40 31.65 28.50 36.20 1.02 KCDC-PT (Kern et al., 2010) 1  11.35 11.00 11.70 53.25 47.70 58.80 32.28 29.30 35.25 3.58 Table 4: WSD results from the SemEval 2010 shared task in terms of V -score, F 1 score and their average.Our adaptive k-means using definitions (last but one line) outperforms all the other systems on the average of V and F 1 , when considering both nouns and verbs, or nouns only.
model and maximum size of phrases in the translation model).It also suggests that MT systems which exploit effectively longer context, as we show here with a sense-aware factored MT system for ambiguous nouns and verbs, can significantly improve their lexical choice and their overall translation quality.
Figure 3: BLEU scores of our WSD+MT factored system on EN/ZH WIT3 data, along with the baseline score (constant), when the size of the context window around each ambiguous token (for building its context vector) varies from 2 to 14.

Word Sense Disambiguation Results
We evaluate in this section our WSD system on the dataset from the SemEval 2010 shared task (Manandhar et al., 2010), to assess how competitive it is, while acknowledging that our system uses external knowledge not available to SemEval participants.
Table 4 shows the WSD results in terms of Vscore and F 1 -score, comparing our method (bot-tom two lines) with other WSD systems that participated in SemEval 2010 (top four systems for each metric).We add three baselines provided by the task organizers for comparison: (1) Most Frequent Sense (MFS), which groups all occurrences of a word into one cluster, (2) 1Cluster-PerInstance, which produces one cluster for each occurrence of a word, and (3) Random, which randomly assigns an occurrence to 1 out of 4 clusters (4 is the average number of senses from groundtruth).
The V-score is biased towards systems generating a higher number of clusters than the number of gold standard senses.F 1 -score measures the classification performance, i.e. how well a method assigns two occurrences of a word belonging to the same gold standard class.Hence, this metric favors systems that generate fewer clusters (for instance, if all instances were grouped into 1 cluster, the F 1 -score would be high).As these two metrics are biased towards either small or large numbers of clusters, their average is a useful metric as well.
Table 4 shows that k-means initialized with definitions achieves high performance and ranks among the top systems for each metric individually, outperforming all other systems on the averaged metric (especially on "All" and "Noun" analysis).Moreover, the adaptive k-means method finds an average number of senses of 4, which is close to the ground-truth value provided by Se-mEval (4.46).These results show that our method, despite its simplicity, is effective and provides competitive performance against prior art, partly thanks to additional knowledge not available to the Table 6: BLEU scores of our WSD+MT factored system, trained separately on disambiguated nouns vs. verbs, and tested separately or jointly, along with baseline MT and oracle WSD+MT, on five language pairs.shared task systems.

Machine Translation Results
Table 5 displays the performance of our factored MT system trained with noun and verb senses on five language pairs.Our system performs consistently better than the MT baseline on all pairs, with the largest improvements achieved on EN/ZH and EN/DE.To better understand the improvements over the baseline MT, we also provide the BLEU score of an oracle system which has access to the reference translation of the ambiguous words through the alignment provided by GIZA++.According to the results, our factored MT system bridges around 40% of the BLEU gap between the baseline MT system and the oracle system on EN/DE and 30% on EN/ZH.As shown in Table 6, the translation quality of our factored MT outperforms the baseline when trained with either noun senses or verb senses separately.However, in some cases, our factored MT system trained with both noun and verb senses performs worse than with noun and verb senses separately.This may be due to the lack of sufficient training data to learn reliably using all the additional factors -as we observed when training on the smaller WIT3 Corpus.
Lastly, Table 7 shows the confusion matrix for our factored MT and the baseline MT systems when comparing the reference translation of nouns and verbs separately, using GIZA++ alignment.In particular, the confusion matrix displays the num-ber of labeled tokens the translation of which is identical to the reference or not (Y, N).As we can observe, the number of tokens that our factored MT system finds correctly while the baseline MT does not, are twice as many as the numbers of tokens that the baseline MT system finds correctly while our factored MT does not.

Conclusion
We presented a sense-aware statistical MT system which obtains access to a longer context than standard ones, through an adaptive context-dependent k-means clustering algorithm for WSD.The algorithm utilizes semantic information from Word-Net to identify the dominant clusters, which correspond to senses in the source side of a parallel corpus.The proposed adaptive k-means method is straightforward, yet it provides competitive WSD performance on data from the Se-mEval 2010 shared task.For MT, our experiments with five language pairs show that our sense-aware MT system consistently improves over the baseline.As future work, we plan to integrate sense information for ambiguous words to neural MT and investigate other effective ways to enable access to longer context.

Figure 1 :
Figure1: Example of sense-aware translation that is closer to a reference translation than a baseline statistical MT system or an online neural one.

Figure 2 :
Figure 2: Adaptive WSD for MT: vectors from WordNet definitions (or examples) are clustered with context vectors of each occurrence (here of 'rock'), resulting in sense labels used as factors for MT.

Table 1 :
Statistics of the corpora used for machine translation: '∼' indicates a similar size, though not identical texts, because the English source texts for the different language pairs from Europarl are different.Hence, the number of words found in WordNet differ as well.

Table 2 :
Performance of our WSD+MT factored system for two language pairs from WIT3, with two initialization conditions for the k-means clusters, i.e. definitions or examples for each sense.

Table 3 :
Performance of our WSD+MT factored system for EN-ZH from WIT3, comparing the two initialization conditions for the k-means clusters, i.e. definitions or examples for each sense, with random initializations.

Table 5 :
BLEU scores of our WSD+MT factored system, with both noun and verb senses, along with baseline MT and oracle WSD+MT, on five language pairs.