HCCL at SemEval-2017 Task 2: Combining Multilingual Word Embeddings and Transliteration Model for Semantic Similarity

In this paper, we introduce an approach to combining word embeddings and machine translation for multilingual semantic word similarity, the task2 of SemEval-2017. Thanks to the unsupervised transliteration model, our cross-lingual word embeddings encounter decreased sums of OOVs. Our results are produced using only monolingual Wikipedia corpora and a limited amount of sentence-aligned data. Although relatively little resources are utilized, our system ranked 3rd in the monolingual subtask and can be the 6th in the cross-lingual subtask.


Introduction
With convenient word representation methods being proposed, word embeddings are successfully utilized in state-of-the-art systems ranging from text classification (Kim, 2014), opinion categorization (Enríquez et al., 2016), machine translation (Zou et al., 2013), to stock price prediction (Peng and Jiang, 2016) and so on.
In earlier studies, the latent semantic analysis (LSA) was introduced by Deerwester (1990). It is called topic model because terms are represented as the vectors of topics and was popularized by Landauer (1997). In 2003, researchers developed the topic model based on latent Dirichlet allocation(LDA) (Blei et al., 2003). LDA did not widely spread until the Gibbs sampling was applied to the on-line training of LDA (Hoffman et al., 2010). Another traditional distributional method, pointwise mutual information metric was proposed by Turney and Pental (2010). Recently, fast distributed embeddings like (Mikolov et al., 2013c) and GloVe (Pennington et al., 2014) are based on the assumption that the meaning of a word de-pends on its context. As Levy et al. (2015) pointed out, there is no significant performance difference between them.
For cross-lingual word representation, there are generally four categories: Monolingual mapping (Mikolov et al., 2013b), pseudo-cross-lingual training (Gouws and Søgaard, 2015), cross-lingual training (Hermann and Blunsom, 2014) and joint optimization (Coulmance et al., 2015). As presented in (Mogadala and Rettinger, 2016) , the joint optimization method represents the state-ofthe-art level in cross-lingual text classification and translation. These methods train embeddings both on monolingual and parallel corpora by jointly optimizing the losses. However, they are rarely used in word similarity due to the unsatisfying performance.
In this task, we adopt different strategies for the two subtasks. We use word2vec for subtask1, monolingual word similarity. For the subtask2, cross-lingual word similarity, we use jointly optimized cross-lingual word representation in addition to transliteration model. We build a crosslingual word embedding system and a special machine translation system. Our approach has the following characteristics: • Fast and efficient. Both word2vec and the cross-lingual word embeddgings tool have impressive speed (Coulmance et al., 2015) and not need expensive annotated wordaligned data.
• Decreasing OOVs. Our translation system is featured by its transliteration model that deal with OOVs outside the parallel corpus.
We constructed a naive system and did not try out the parameters for embeddings and translation models in limited time.

Our Approach
We use skip-gram word embeddings directly for monolingual subtask. For cross-lingual subtask, we use English as pivot language and train multilingual word embeddings using monolingual corpora and sentence-aligned parallel data. A translation model is also trained by our statistical machine translation system. Subsequently, we translate the words in the test set into English and look up their word embeddings. For those out of English word embeddings, we check them from original language word embeddings.

Word Embeddings
For monolingual task, we choose word2vec to generate our word representations for robustness reason. Mikolov (2013c) modeled input word embeddings ⃗ w as the weights from the input layer to the projection layer and its output vector ⃗ w o as weights from the projection layer to the one-hot output layer.
Skip-gram Model. The skip-gram model assumes that P (w|c) = σ( ⃗ w · ⃗ c), with c as the embedding of context. Then minimize the loss function which is simplified as: where C is the sentence set of training corpus, s means a sentence and l is the window length. σ is the sigmoid function. Negative sampling is ignored in the equation for simplification.
Trans-gram Model. With skip-gram model introduced, we now extend it to the trans-gram model (Coulmance et al., 2015) for cross-lingual task. For sentence aligned data A s,t , where s is the source language and t is the target language, we consider the whole sentence s t as the context of each word w s in sentence s s . The loss for the source language is written as: The skip-gram model also adopts the negative sampling.
The skip-gram model is famous for its efficiency (Mikolov et al., 2013a). The trans-gram model is of the same computational complexity, thus has the same speed. Although the crosslingual embeddings can be trained fast, their performance on word similarity task is unsatisfying (0.493 of correlation) with word aligned data (Luong et al., 2015). So we turn to machine translation for steady performance with assistance of these word embeddings.

Machine Translation System
We constructed a phrase-based statistical machine translation (SMT) system with the transliteration model (TM) (Durrani et al., 2014). Our SMT system is illustrated in Figure 1. Like most of the phrased-based machine translation model, our system follow the steps which are shallow gray in the diagram. First we use GIZA++ (Och and Ney, 2003) as our aligner to align words and get lexical translation table. Then phrases are extracted and we estimate their translation scores directly and inversely by refining the word alignments heuristically. Subsequently, a distance-based bidirectional reordering model conditioned on both source and target language is built to arrange the word orders. For more details, please see (Koehn et al., 2003). Since our SMT system is a discriminative model, after all the features are captured, their weights are tuned using minimum error rate training (MERT) (Och, 2003). We choose KenLM (Heafield et al., 2013) as our language model and a stack decoder (Zens and Ney, 2008) with beam search for our system.
Transliteration model. Since the parallel corpus is of small size and the coverage of words is very limited, we apply a transliteration model to translate the OOVs. It models the character re-lationships between words and generate words at the character level. For the word alignments with character relationship, consider a word pair (e, f ), the transliteration model is defined as: where Align(e, f )is the set of possible character alignment sequence, a is one of the alignment sequences, q j is one alignment. For word pairs without character relation, it is modeled by multiplying source and target character unigram models. The whole model is defined as the combination of transliteration and non-transliteration submodel, where λ is the prior probability of nontransliteration: The transliteration model learns the character alignment using expectation maximization (EM) over the character pairs. λ is computed in the tuning stage of the whole system.

Implementation
Word representations based on different corpus may have a significant gap on the performance. Larger corpus typically generate better word embeddings. But we only use the shared corpus for comparison.
Data. We use the benchmark monolingual Wikipedia and Europarl copora in the task description ( Preprocessing. For Wikipedia data, we first filter out the stop words using the list from RANKS NL 1 . Then we clean up digits and normalize the marks. Empty lines and web tags are deleted further. For parallel data, we just filter out the stop words and normalize the marks. Parallel data are split with 99% as training set and 1% as develop set for tuning in translation system. similarity score. We use the cosine distance of two embeddings as the similarity score of a word pair. Its range is [-1,1].

Monolingual Experiments
We conduct an experiment on English word embeddings to see the performance of our vectors. We use phrasing and positional context when training. The phrasing is to extract phrased based on co-occurence and the threshold is 400. Positional context treats the same word in different position as different words. Our monolingual embeddings are trained with 500 dimension, 5 iterations, 15 negative samples, win=5 and mincount=10. We use similary part of WordSim353 (Agirre et al., 2009), MEN (Bruni et al., 2012) , M.Turk (Radinsky et al., 2011), Rare Words (Luong et al., 2013) and SimLex (Hill et al.) as test sets, which contain 203, 3000, 287, 2034 and 999 word pairs respectively. The results of our embeddings and in (Levy et al., 2015) of the same window size without phrasing and positional context are listed in Table 1.
The performance of the submitted systems (extra resources are used) including ours (in bold) and RUFINO (the other system uses the same corpus) on all languages are listed in Table 2.

Cross-lingual Experiments
In the cross-lingual word similarity subtask each word pair is composed by words in different languages. This subtask consists of ten cross-lingual word similarity datasets: EN-DE, EN-ES, EN-FA, EN-IT, DE-ES, DE-FA, DE-IT, ES-FA, ES-IT, and FA-IT. We define the OOVs as the words that can either be found in parallel data or word embeddings. In this subtask, due to the limited amount of parallel data, OOVs occupy a large proportion in the test sets. We show the statistics of OOVs in test sets before, after transliteration model and their final counts after looking up cross-lingual word embeddings in Table 3.
In subtask 2, for the sake of limited time, we did not use phrasing and positional context like in subtask1. For phrases in test sets, we sum up the vectors of all word in the phrase as its embedding. The results of random embeddings that equal to random guess without any semantics, correct results of our system and the top system (Lu-minoso2) are listed in Table 4.

Results
Compared with the results in (Levy et al., 2015), our embeddings have an improvement of 4.2% on WordSim353s and 3.3% on SimLex while have a slight decline of 0.3% on MEN, 1.3% on M.Turk and 1.0% on RareWords. Thus phrasing and positional context fail to benefit word embeddings on some test sets. It is also concluded that the embeddings we trained are comparable. Table 2 shows that our system is ranked 3rd and behave steadily better than RUFINO for subtask1. With phrasing and positional context, Word2vec can achieve satisfying performance.
As we can see in Table 3, up to 43.3% of OOVs are significantly reduced , which are generated at the character level with transliteration model and proved to be real words. It is revealed that our transliteration model can saliently reduce OOVs.
Our cross-lingual system was ranked 8th in official results because of using mismatched data. We rerun our model using the correct data and our true results (will be mentioned in task description paper) listed in Table 4 can rank the 6th. It can be seen that our results for subtask2 are much better than that of the random embeddings, which is equal to guess blindly. However, the gap between the best system and ours is significant. Not enough parallel data and training epochs for non-English embeddings may account for this.

Conclusion
For mono-lingual subtask, we train word2vec based word embeddings with positional context and phrasing. For cross-lingual subtask, we built a cross-lingual word representation model and statistical machine translation system with an unsupervised transliteration model, which can greatly translate OOVs. We are the only team that uses the benchmark corpus and achieve good performance on both subtasks. But in global ranking for open resources, there is much space for improvement, i.e. using more iterations, resources and advanced models.