Multi-lingual Common Semantic Space Construction via Cluster-consistent Word Embedding

We construct a multilingual common semantic space based on distributional semantics, where words from multiple languages are projected into a shared space via which all available resources and knowledge can be shared across multiple languages. Beyond word alignment, we introduce multiple cluster-level alignments and enforce the word clusters to be consistently distributed across multiple languages. We exploit three signals for clustering: (1) neighbor words in the monolingual word embedding space; (2) character-level information; and (3) linguistic properties (e.g., apposition, locative suffix) derived from linguistic structure knowledge bases available for thousands of languages. We introduce a new cluster-consistent correlational neural network to construct the common semantic space by aligning words as well as clusters. Intrinsic evaluation on monolingual and multilingual QVEC tasks shows our approach achieves significantly higher correlation with linguistic features which are extracted from manually crafted lexical resources than state-of-the-art multi-lingual embedding learning methods do. Using low-resource language name tagging as a case study for extrinsic evaluation, our approach achieves up to 14.6% absolute F-score gain over the state of the art on cross-lingual direct transfer. Our approach is also shown to be robust even when the size of bilingual dictionary is small.


Introduction
More than 3,000 languages have electronic record, e.g., at least a portion of the Christian Bible had been translated into 2,508 different languages. However, the training data for mainstream natural language processing (NLP) tasks such as Information Extraction (IE) and Machine Translation 1 The resources and programs are available for research purpose: https://github.com/wilburOne/CommonSpace/ (MT) is only available for dozens of dominant languages. In this paper we aim to construct a multilingual common semantic space where words in multiple languages are mapped into a distributed, language-agnostic semantic continuous space, so that resources and knowledge can be shared across languages.
Previous multilingual embedding methods align the semantic distributions of words from multiple languages within the common semantic space. Though several recent attempts (Artetxe et al., 2017(Artetxe et al., , 2018Conneau et al., 2017) have shown that it is possible to extract multilingual word embedding from a pair of potentially unaligned corpora in multiple languages, we claim that it is necessary to impose more constraints to preserve linguistic properties and facilitate downstream NLP tasks, such as cross-lingual IE, and MT. We find that words also can be clustered through explicit (e.g., sharing affixes of certain linguistic functions) or implicit clues (e.g., sharing neighbors from monolingual word embedding) and such clusters should also be consistent across multiple languages. To do so, we design a new algorithm, called clusterconsistent multilingual word embedding, that extracts multilingual word embedding vectors which preserve the natural clustering structures of words across multiple languages.
We propose to create clusters through three kinds of signals as follows, without any extra human annotation effort. Then we aggregate the embedding vectors of words in each cluster and ensure that the clusters (or the words therein) are consistent across multiple languages.
Neighbor based clustering and alignment. We build our common space based on correlational neural network (CorrNet) which is an extension of autoencoder framework by enabling cross-lingual reconstruction. In contrast to previous work (Chandar et al., 2016;Rajendran et al., 2015), we extend CorrNet to neighbor-consistent correlation network by using each word's neighbors (the nearest words within monolingual semantic space) to ensure that the cross-lingual mapping from and to the common semantic space is locally smooth. For instance, the neighboring words of China in English (Japan, India and Taiwan) should be close to the neighboring words of Cina in Italian (Beijing, Korea, Japan) in the common semantic space. In other words, we encourage the consistency of neighborhoods across multiple languages.
Character based clustering and alignment. Many related languages share very similar character set, and many words that refer to the same concept share similar compositional characters or patterns, e.g., China (English), Kina (Danish), and Cina (Italian).
Linguistic property based clustering and alignment. Many languages also share linguistic properties, e.g., apposition, conjunction, and plural suffix (English (-s / -es), Turkish (-lar / -ler), Somali (-o)). Linguists have created a wide variety of linguistic property knowledge bases, which are readily available for thousands of languages. For example, the CLDR (Unicode Common Locale Data Repository) 2 includes closed word classes and affixes indicating various linguistic properties. We propose to take advantage of these languageuniversal resources to create clusters, where the words within one cluster share the same linguistic property, and build alignment between clusters for common semantic space construction.
We evaluate our approach on monolingual and multilingual QVEC (Tsvetkov et al., 2015) tasks, which measure the quality of word embeddings based on the alignment of the embeddings to linguistic feature vectors extracted from manually crafted linguistic resources, as well as an extrinsic evaluation on name tagging for low-resource languages. Experiments demonstrate that our framework is effective at capturing linguistic properties and significantly outperforms state-of-the-art multi-lingual embedding learning methods.

Related Work
Multilingual word embeddings have advanced many multilingual NLP tasks, such as machine translation (Zou et al., 2013;Mikolov et al., 2013b;Madhyastha and España-Bonet, 2017), de-2 cldr.unicode.org pendency parsing (Guo et al., 2015;Ammar et al., 2016a), and name tagging (Zhang et al., 2017a;Tsai and Roth, 2016;Zhang et al., 2018;Cheung et al.;Zhang et al., 2017b;Feng et al., 2017). Using bilingual aligned words, previous methods project multiple monolingual embeddings into a shared semantic space using linear mappings (Mikolov et al., 2013b;Rothe et al., 2016;Baroni et al., 2015;Xing et al., 2015;Smith et al., 2017) or canonical correlation analysis (CCA) (Ammar et al., 2016b;Faruqui and Dyer, 2014;Lu et al., 2015). Compared with CCA, which only optimizes the correlation for each individual pair of languages, linear mapping based methods can jointly optimize all the languages in the common semantic space. We focus on learning linear mappings to construct the common semantic space and adopt correlational neural networks (CorrNet) (Chandar et al., 2016;Rajendran et al., 2015) as the basic model. In contrast to previous work which only exploited monolingual word semantics, we introduce multiple cluster-level alignments and design a new cluster consistent CorrNet to align both words and clusters.
Another branch of approaches for multilingual word embeddings are based on parallel or comparable data, such as parallel sentences (AP Chandar et al., 2014;Gouws et al., 2015;Luong et al., 2015;Hermann and Blunsom, 2014;Schwenk et al., 2017), phrase translations (Duong et al., 2016) and comparable documents (Vulic and Moens, 2015). Moreover, to reduce the need of bilingual alignment, several approaches have been designed to learn cross-lingual embeddings based on a small seed dictionary (Vulic and Korhonen, 2016;Artetxe et al., 2017), or even with no supervision (Cao et al., 2016;Zhang et al., 2017d,c;Conneau et al., 2017;Artetxe et al., 2018). However, such methods are still limited to bilingual word embedding learning and remaining to be explored for common semantic space construction.
3 Approach 3.1 Overview Figure 1 shows the overview of our neural architecture. We project all monolingual word embeddings into a common semantic space based on word-level as well as cluster-level alignments and learn the transformation functions. First, on word-level, we build a neighborhood-consistent CorrNet to augment word representations with neighbor based clusters and align them in the common semantic space. In addition, we apply a language-independent convolutional neural networks to compose character-level word representation and concatenate it with word representation in the common semantic space. Finally, we construct clusters based on linguistic properties, including closed word classes and affixes, and align them in the common semantic space. We jointly optimize for all the alignments in the common semantic space for each pair of languages.

Basic Model
We briefly describe the basic model for learning the common semantic space: correlational neural networks (CorrNets) (Chandar et al., 2016;Rajendran et al., 2015). It combines the advantages of canonical correlation analysis (CCA) and autoencoder (AE).
Given the bilingual aligned word pairs between two languages l 1 and l 2 , we first use their monolingual word embeddings to initialize each word with a vector and obtain M l 1 ∈ R |V l 1 |×d l 1 and M l 2 ∈ R |V l 2 |×d l 2 , where V l 1 and V l 2 are the bilingual dictionary of l 1 and l 2 . V i l 1 is the translation of V i l 2 , and d l 1 and d l 2 are the vector dimensionalities. Then for each language we learn a linear projection function to project M l 1 and M l 2 into the common semantic space: where H l 1 ∈ R |V l 1 |×h and H l 2 ∈ R |V l 2 |×h are the vector representations for V l 1 and V l 2 in the common semantic space respectively. h is the vector dimensionality in the shared semantic space. W l 1 ∈ R d l 1 ×h and W l 2 ∈ R d l 2 ×h are the transformation matrices, and b l 1 and b l 2 are the bias vectors. σ denotes Sigmoid function.
After we project the monolingual embeddings into the common semantic space, we further reconstruct M l 1 and M l 2 from H l 1 and H l 2 separately: are cross-lingual reconstructions. W l 1 and W l 2 are the transposes of W l 1 and W l 2 respectively. To learn the common semantic representations, we minimize the distance between the aligned word vectors as well as the loss of monolingual and cross-lingual reconstruction: , where l denotes any language that we want to project into the common semantic space, A denotes all bilingual dictionaries, and L denotes a similarity metric. In our work, we use cosine similarity as the similarity metric.

Neighborhood-Consistent CorrNet
CorrNet can project multiple monolingual word embeddings into a common semantic space using bilingual word alignment. However, the same concepts may have different semantic bias in various languages. For example, the top five nearest words of the concept "China" are: (Japan, India, Taiwan, Chinese, Asia) in English, (Cosco, Shenzhen, Australian, Shanghai, manufacturing) in Danish, and (Beijing, Korea, Japan, aluminum, copper) in Italian respectively. In order to ensure the consistency of the neighborhoods within the common semantic space and make the cross-lingual mapping locally smooth, we propose to augment monolingual word representation with its top-N nearest neighboring words from the original monolingual semantic space. 3 Given the monolingual embeddings of the bilingual aligned words for two languages l 1 and l 2 , M l 1 and M l 2 , for each word, we extract the top-N nearest neighbors and construct the neighborhood clusters. Each cluster t l = {w 1 , w 2 , ..., w |t l | } in language l is represented by where E w denotes the monolingual word embedding for w. We obtain all the neighborhood cluster vector representations C l 1 , C l 2 for l 1 and l 2 . We incorporate the neighborhood cluster information into the common semantic space when projecting monolingual embeddings: Besides the monolingual and cross-lingual reconstructions for M l 1 and M l 2 in CorrNets, we also add monolingual and cross-lingual reconstructions for the neighborhood clusters: 3 We set N = 10 in our experiments since it performed best on the intrinsic evaluation among {2, 5, 10, 20, 50}.
In addition to optimizing the loss functions described in the Section 3.2, we further optimize the monolingual and cross-lingual reconstruction for neighborhood clusters:

Character-Level Word Alignment
Bilingual word alignment is not always enough or available to induce a common semantic space, especially for low-resource languages. Although the words that refer to the same concept are not exactly the same in multiple languages, they often share a set of similar characters, especially in related languages written in the same script. For example, the same entity is spelled slightly differently in three languages: Semsettin Gunaltay in English, Şemsettin Günaltay in Turkish, and Semsetin Ganoltey in Somali. Beyond word-level alignment, we introduce character-level alignment by composing word representations from its compositional characters using convolutional neural networks (CNN). For each language, we adopt a language-independent CNN to generate characterlevel word representation. Character Lookup Embeddings Let S l be the character set for language l and E S l ∈ R |S l |×d be the character lookup embeddings, where d is the dimensionality of each character embedding.
Here, we use a simple yet effective method to induce character embeddings from word embeddings 4 . For each character c, we initialize its embedding by averaging the embeddings of all words which contain the character. The character embeddings will be further tuned by the model. Character-Level CNN (Kim et al., 2016) The input layer is a sequence of characters of length k for each word. Each character is represented by a d-dimensional lookup embedding. Thus each input sequence is represented as a feature map of dimensionality d × k.
We use the convolution layer to learn the representation for each sliding n-gram characters. We make p i as the concatenated embeddings of n continuous columns from the input matrix, where n is the filter width. We then apply the convolution weights W ∈ R d×nd to p i with a biased vector b ∈ R d , i.e., p i = tanh(W · p i + b). All n-gram representations p i are used to generate the word representation y by max-pooling.
In our experiments, we apply multiple filters with various widths to obtain the representation for word w l i . The final character-level word representationŵ l i is the concatenation of all word representations with varying filter widths. Cross-lingual Mapping Given the bilingual aligned word pairs, we directly minimize the distance of the character-level word representations in the common semantic space by: The final word representation of w l i in the common semantic space is the concatenation of character-level word presentationŵ l i and projected word representation h l i .

Linguistic Property Alignment
Linguists have made great efforts at building linguistic property knowledge bases for thousands of languages in the world. These knowledge bases include a large number of topological properties (phonological, lexical and grammatical) which we will use to build a high-level alignment between words across languages. We exploit the following resources: • CLDR (Unicode Common Locale Data Repository) 5 which includes multilingual gazetteers for months, weekdays, cardinal and ordinal numbers; • Wiktionary 6 which is a multilingual, webbased collaborative project to create an English content dictionary, includes word and prefix/suffix dictionaries for 1,247 languages; • Panlex 7 database which contains 1.1 billion pairwise translations among 21 million expressions in about 10,000 language varieties.
We mainly exploit two types of linguistic properties to extract word clusters. The first type is language-independent closed word classes, such as colors, weekdays, and months. These properties tend to be consistent across many languages. For example, "-like" is a suffix denoting "similar to" in English, while in Danish "-agtig" performs the same function. Wiktionary and Panlex include the affix alignments between English and any other languages. We filtered out the many-to-many affix alignments and obtained hundreds of alignments between each language and English. For each affix, we derive a set of word pairs (basic word, extended word with affix) by first selecting all the word pairs where basic word + affix = extended word, then ranking all word pairs based on the cosine similarity of their monolingual word embedding. Finally we select the top ranked 20 word pairs to form the cluster for each affix. We extract a set of word clusters from each language, and align the clusters based on their functions defined in CLDR, Wiktionary and Panlex. For each language l, each cluster r l i ∈ R l contains a set of words or word-pairs sharing the same function. We use the average operation to obtain an overall vector representation for each cluster M R l . 8 Then, we project the cluster-level vectors into the shared semantic space and minimize the distance between them: where W is the same as the W used in Section 3.3 for each language. We finally optimize the sum of the losses by finding the parameters where l denotes a specific language:

Experiment Setup
Previous work (Ammar et al., 2016b;Duong et al., 2017) evaluated multilingual word embeddings on a series of intrinsic (e.g., monolingual and crosslingual word similarity, word translation) and extrinsic (e.g., multilingual document classification, multilingual dependency parsing) evaluation tasks. In order to evaluate the quality of the multilingual embeddings, we use QVEC (Tsvetkov et al., 2015) tasks (details will be described in Section 4.2) as the intrinsic evaluation platform. In addition, to demonstrate the effectiveness of our common semantic space for knowledge transfer, especially for low-resource scenarios, we adopt the low-resource language name tagging task for extrinsic evaluation.
For fair comparison with state-of-the-art methods on building multi-lingual embeddings (Ammar et al., 2016b;Duong et al., 2017), we use the same monolingual data and bilingual dictionaries as in their work. We build multilingual word embeddings for 3 languages (English, Italian, Danish) and 12 languages (Bulgarian, Czech, Danish, German, Greek, English, Spanish, Finnish, French, Hungarian, Italian, Swedish) respectively. The monolingual data for each language is the combination of the Leipzig Corpora Collection 9 and Europarl. 10 The bilingual dictionaries are the same as those used in Ammar et al. (2016b). 11 For each task, we evaluate the performance of our common semantic space in comparison with previously published multilingual word embeddings (MultiCluster, MultiCCA, MultiSkip, and MultiCross) 12 . MultiCluster (Ammar et al., 2016b) groups multilingual words into clusters based on bilingual dictionaries and forces all the words from various languages within one cluster share the same embedding. MultiCCA (Ammar et al., 2016b;Faruqui and Dyer, 2014) uses CCA to estimate linear projections for each pair of languages. MultiSkip is an extension of the multilingual skip-gram model (Luong et al., 2015), which requires parallel data. MultiCross is an approach to unify bilingual word embeddings into a shared semantic space using post hoc linear transformations (Duong et al., 2017). Table 2 lists the hyper-parameters used in the experiments.

Intrinsic Evaluation: QVEC
In order to evaluate the quality of multilingual embeddings, we adopt QVEC (Tsvetkov et al., 2015) as the intrinsic evaluation measure. It evaluates the quality of word embeddings based on the alignment of distributional word vectors to linguistic feature vectors extracted from manually crafted lexical resources, e.g., SemCor (Miller et al., 1993). For each word, each dimension of its linguistic feature vector defines the probability of that word belongs to a supersense (e.g., NN.FOOD) which is summarized from Word-Net (Fellbaum, 1998).
QVEC is computed as where x ∈ R D×1 denotes a distributional word vector and s ∈ R P ×1 denotes a linguistic word vector. D and P denote the sizes of vectors respectively. a ij = 1 iff x i is aligned to s j , otherwise a ij = 0. r(x i , s j ) is the Pearson's correlation between x i and s j . QVEC-CCA (Ammar et al., 2016b) is extended from QVEC by using CCA to measure the correlation between the distributional matrix and the linguistic vector matrix, instead of cumulative dimension-wise correlation.  Table 3: QVEC and QVEC-CCA scores. W: word alignment. N: neighbor based clustering and alignment. C: character based clustering and alignment. L: linguistic property based clustering and alignment.
As shown in Table 3, our approaches outperform previous approaches in almost all cases 13 . Specifically, by augmenting word representation with neighboring words in the common semantic space as in Eq. (1), the performance for monolingual and multilingual QVEC and QVEC-CCA tasks is consistently improved. In addition, by aligning character-level compositional representations and linguistic property based clusters in the shared semantic space, the monolingual and multilingual representation quality is further improved.

Impact of Bilingual Dictionary Size
In order to show the impact of the size of bilingual lexicons, we use three languages as a case study, and gradually reduce the size of the lexicons for each pair of languages from 40,000 to 10,000 and further to 2,000, 1,000, 500 and 250. For following experiments, we use MultiCluster and Multi-CCA as baselines 14 . Table 4 shows the results. We observe that both MultiCCA and CorrNet approaches are sensitive to the size of the bilingual lexicons. Our approach on the other hand can maintain high performance, even when the size of bilingual lexicons is reduced to 250. The performances of MultiCluster based on various sizes of bilingual dictionary are close because it jointly trains the embedding of multiple languages from scratch and by default takes advantage of identical strings among all the languages. 13 We conduct paired t-test between CorrNet W+N+Ch+L and all the other models on 10 randomly sampled subsets. The differences are all statistically significant while all pvalues are less than 0.05 14 MultiSkip requires parallel corpora to train cross-lingual embeddings while the original implementation of MultiCross is not public.

Low-Resource Name Tagging
We evaluate the quality of multilingual embeddings on a downstream task by using the embeddings as input features. Here, we use low-resource language name tagging as a target task, which aims to automatically identify and named entities from text and classify them into certain types, including Person (PER), Location (LOC), Organization (ORG), and Geo-Political Entities (GPE). We experiment with two sets of languages. The first set Amh+Tig consists of Amharic and Tigrinya. Both languages share the same Ge'ez script and descend from the proto-Semitic language family. The other set Eng+Uig+Tur consists of one high-resource language (English), one mediumresource language (Turkish) and one low-resource language (Uighur). It also consists of two distinct language scripts: English and Turkish use Latin script while Uighur uses Arabic script.
We use an LSTM-CRF architecture (Huang et al., 2015;Lample et al., 2016;Ma and Hovy, 2016) for name tagging. It takes only word embedding as input and predict a tag for each word. Table 5 shows the statistics of training, development, and test sets for each language released by Linguistic Data Consortium (LDC). 15 For each language pair, we combine the bilingual aligned words extracted from Wiktionary and monolingual dictionaries based on identical strings. 16 We evaluate the quality from several aspects: 15 The annotations are from: Amh (LDC2016E87), Tig (LDC2017E27), Uig (LDC2016E70), Tur (LDC2014E115), Eng (Tjong Kim Sang and De Meulder, 2003). We combined these corpora with Wikipedia dump to train word embeddings with Word2Vec toolkit (Mikolov et al., 2013a). 16    Monolingual embedding quality evaluation Table 6 shows the name tagging performance for each language using the original monolingual embeddings and multilingual embeddings. For all languages, the multilingual embeddings learned from our approach significantly outperform those learned from MultiCCA and MultiCluster, which shows the effectiveness of our approach. More importantly, the multilingual embeddings learned from our approach also outperform original monolingual embeddings, which demonstrates that by projecting multiple languages into one common space, the monolingual embedding quality can be further improved.
Cross-lingual direct transfer In this setting, we train a name tagger on one or two languages using multilingual embeddings and test it on a new language without any annotated data.  approach achieves better performance than Mul-tiCCA and MultiCluster. The closer that the languages are, such as Amharic and Tigrinya, the better performance is achieved. Cross-lingual mutual enhancement We finally show the improvement by adding more crosslingual annotated data and also using multilingual embeddings in   tated set from other languages (e.g., Uig+Tur or Eng+Uig+Tur) doesn't necessarily result in improvement. This is partially due to the use of Arabic script in Uighur, which differs from Turkish and English. Thus we suggest to project closely related languages using the same script into the common semantic space. We take Turkish name tagging as a case study to show the benefit of the common semantic space with extra English annotations. The monolingual model failed to identify Belgrad´da as a geopolitical entity (GPE) because it doesn't occur in Turkish training data. However, by adding English annotations, the tagger successfully tags it as a GPE since it's semantically close to Belgrade in the common semantic space according to their character level compositional embeddings and Belgrade is frequently tagged as GPE in English annotations. In another example, using Turkish annotations only, Kraliyet Donanması´na is mistakenly tagged as a GPE since it's following da and all entity mentions following da in Turkish annotations are annotated as GPE. After adding English annotations into training, it is correctly tagged as an ORG because da is well aligned with in in the common semantic space according to the linguistic property alignment between Turkish and English, and many entity mentions following in are annotated as ORG in English annotations.

Conclusions and Future Work
We construct a common semantic space for multiple languages based on a cluster-consistent correlational neural network. It combines word-level alignment and multi-level cluster alignment, including neighbor based clusters, character-level compositional word representations, and linguistic property based clusters induced from the readily available language-universal linguistic knowledge bases. Our approach achieved significantly higher performance than state-of-the-art multilingual embedding learning methods through both intrinsic and extrinsic evaluations. In the future, we will further extend our approach to multi-lingual multimedia common semantic space construction.