Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition

In this paper, we propose Multilingual Meta-Embeddings (MME), an effective method to learn multilingual representations by leveraging monolingual pre-trained embeddings. MME learns to utilize information from these embeddings via a self-attention mechanism without explicit language identification. We evaluate the proposed embedding method on the code-switching English-Spanish Named Entity Recognition dataset in a multilingual and cross-lingual setting. The experimental results show that our proposed method achieves state-of-the-art performance on the multilingual setting, and it has the ability to generalize to an unseen language task.


Introduction
Learning a representation through embedding is a fundamental technique to capture latent word semantics (Clark, 2015). Practically, word-level representation has been extensively explored to improve many downstream natural language processing (NLP) tasks (Mikolov et al., 2013;Pennington et al., 2014;Grave et al., 2018). A new wave of "meta-embeddings" research aims to learn how to effectively combine pre-trained word embeddings in supervised training into a single dense representation (Yin and Schütze, 2016;Muromägi et al., 2017;Coates and Bollegala, 2018;. This method is known to be effective to overcome domain and modality limitations. However, the generalization ability of previous works has been limited to monolingual tasks, so we aim to extend the method to multilingual contexts which benefits the processing of code-switching text.
In multilingual societies, speakers tend to move back and forth from one language to another during the same conversation, which is commonly ... called "code-switching". Code-Switching is produced in both written text and speech in a discourse. Recent studies in code-switching has been mainly focused on natural language tasks, such as language modeling (Winata et al., 2018a;Pratapa et al., 2018;Garg et al., 2018), named entity recognition (Aguilar et al., 2018), and language identification (Solorio et al., 2014;Molina et al., 2016;Barman et al., 2014). Code-Switching is considered as a challenging task because words from different languages may co-exist within a sequence, and models are required to recognize the context of mixed-language sentences. Meanwhile, some words with the same spelling may have entirely different meanings (e.g., cola in English and Spanish) (Winata et al., 2018b). Language identifiers were commonly used to solve the word ambiguity issue in mixed-language sentences. However, it may not reliably cover all code-switching cases, and it creates a bottleneck that would require large-scale crowdsourcing to annotate language identifiers in code-switching data correctly.
To overcome the code-switching problem, we introduce a multilingual meta-embedding model learned from different languages. Our approach can be seen as a method to create a universal mul-tilingual meta-embedding learned in a supervised way with code-switching contexts by gathering information from monolingual sources. Concurrently, this is a language-agnostic approach where it does not require any language information of each word. We show the possibility of transferring information from multiple languages to unseen languages, and this approach can also be useful for a low-resource setting. To effectively leverage the embeddings, we use FastText subwords information to solve out-of-vocabulary (OOV) issues. By applying this method, our model can align the words with the corresponding languages. Our contributions are two-fold: • We propose to generate multilingual metarepresentations from pre-trained monolingual word embeddings. The model can learn how to construct the best word representation by mixing multiple sources without explicit language identification.
• We evaluate our multilingual metaembedding on English-Spanish codeswitching Named Entity Recognition (NER). The result shows the effectiveness of the method on multilingual setting and demonstrates that our meta-embedding can generalize to unseen languages in a cross-lingual setting.

Meta-Embeddings
Word embedding pre-training is a well-known method to transfer the knowledge from previous tasks to a target task that has fewer high-quality training data. Word embeddings are commonly used as features in supervised learning problems. We propose to generate a single word representation by extracting information from different pretrained embeddings. We extend the idea of metaembeddings from  to solve a multilingual task. We define a sentence that consists of m words {x j } m j=1 , and {w i,j } n j=1 word vectors from n pre-trained word embeddings.

Baselines
We compare our method to two baselines: (1) concatenation and (2) linear ensembles.
Concatenation We concatenate word embeddings by merging the dimensions of word representations. This is the simplest way to utilize all sources of information; however, it is very inefficient due to the high-dimensional input: (1) Linear Ensembles We sum all word embeddings into a single word vector with an equal weight. This method is efficient since it does not increase the dimensionality of the input. We apply a projection layer through w i,j to have equal dimension before we sum: where a j ∈ R l×d and b j ∈ R d are trainable parameters, and l and d are the original dimensions of the pre-trained embeddings and projected dimensions respectively.

Multilingual Meta-Embedding
We generate a multilingual vector representation for each word by taking a weighted sum of monolingual embeddings. Each embedding w i,j is projected with a fully connected layer with a nonlinear scoring function φ (e.g., tanh) into a ddimensional vector, and an attention mechanism to calculate attention weight α i,j ∈ R d :

Named Entity Recognition
Our proposed model is based on a self-attention mechanism from a transformer encoder (Vaswani et al., 2017) followed by a Conditional Random Field (CRF) layer (Lafferty et al., 2001).
Encoder Architecture We apply a multi-layer transformer encoder as our sentence encoder: where W t is the projection matrix, W p is the positional encoding matrix, W o is the output layer, h 0 is the first layer hidden states, and h l is the output representation from the final transformer layer. The output of the final layer is logits o.
Conditional Random Field This model calculates the dependencies across tag labels. NER requires a stronger constraint where I-PERSON should follow only after B-PERSON. We use CRF to learn the correlations between the current label and its neighbors (Lafferty et al., 2001). We consider A ∈ R (k+2)×(k+2) as a trainable matrix, transition scores of the tags, where k is the number of tags. A i,j denotes the transition score from tag i to tag j. We include a start tag and an end tag in the matrix, and calculate the score of a tag sequence y given o as follows: where P i,y i ∈ R n×k represents the output probability of the tags. We use the Viterbi algorithm to select the best sequence.

Dataset
For our experiment, we use English-Spanish tweets data provided by Aguilar et al. (2018). There are nine entity labels. The labels use IOB format, where every token is labeled as a B-label in the beginning and then an I-label if it is a named entity, or O otherwise.

Experimental Setup
We use pre-trained FastText 1 English (EN) and Spanish (ES) word embeddings (Grave et al., 2018) as our primary language embeddings, and pre-trained FastText Catalan (CA) and Portuguese (PT) word embeddings as our auxiliary language embeddings. We opt for CA and PT because they come from the same Romance language family as Spanish. We also include GloVe Twitter English embedding (GLOVE_EN) (Pennington et al., 2014). 2 Experiments are conducted in two different settings. In the multilingual setting, we learn our meta-embedding from primary languages and auxiliary languages, while in the cross-lingual setting only auxiliary languages are used. We run all experiments five times and calculate the average and standard deviation. To improve our final predictions, we ensemble all five experiments and take the results from a majority consensus. Implementation Details Our model is trained using a Noam optimizer with a dropout of 0.1 for multilingual setting and 0.3 for the crosslingual setting. Our model contains four layers of transformer blocks with a hidden size of 200 and four heads. We start the training with a learning rate of 0.1. We replace user hashtags (#user) and mentions (@user) with <USR>, and URL (https://domain.com) with <URL>, similarly to Winata et al. (2018b).

Results
Multilingual experimental results are shown in Table 1. Interestingly, both concatenation and linear ensemble are strong baselines since they can achieve higher performance compared to any existing works that use more complicated features, such as character-based features using a bidirectional long short-term memory (LSTM) (Winata et al., 2018b; or a convolutional neural network (CNN) with additional gazetteers (Trivedi et al., 2018). Overall, our transformer encoder using a single word embedding achieves better performance compared to the LSTM encoder    Trivedi et al. (2018); . More importantly, MME outperforms the two baselines on different language combinations, which shows its effectiveness. The results also show that the two baselines cannot effectively exploit the information from auxiliary languages. Here we note that the main advantage of MME is that it dynamically weights the different language pre-trained embeddings for each input token, while the concatenation and linear ensemble approaches always score the weights equally.
In the cross-lingual setting, our model does not perform well when we only use one auxiliary language, as seen in Table 2. A significant improvement is shown after we combine both languages, and MME shows a similar performance to the previous state-of-the-art result (Trivedi et al., 2018). This implies that our approach can effectively generalize word representations on an unseen language task by transferring information from lan-guages that come from the same root as the primary languages.
We inspect the assigned weights on word embeddings to see which embedding our model attends. Figure 2 visualizes the weights for the multilingual and cross-lingual cases. It appears that our model can align words to their languages (e.g., Spanish words, such as "ti", "te", and "ponen" attend to ES) with strong confidences. In most cases, our model strongly attends to a single language and takes a small proportion of information from other languages. It shows the potential to automatically learn how to construct a multilingual embedding from semantically similar embeddings without requiring any language labels.

Related Work
Early studies on named entity recognition heavily relied on language-specific knowledge resources, such as hand-crafted features or gazetteers (Lafferty et al., 2001;Ratinov and Roth, 2009;Tsai et al., 2016). However, this approach was costly for new languages and domains. Thus, end-toend approaches that do not rely on any external knowledge were proposed. Sobhana et al. (2010) proposed to use a CRF without any external resources, to leverage the label dependencies. Then, neural-based approaches, such as LSTM with a CRF (Lample et al., 2016;Lin et al., 2017;Greenberg et al., 2018) and LSTM with a CNN (Chiu and Nichols, 2016) showed a significant improvement in performance. Liu et al. (2018); Trivedi et al. (2018) proposed a character-level LSTM to capture the underlying style and structure, such as word boundaries and spellings. Finally, wordembedding ensemble techniques and preprocessing techniques, such as tokenization and normal-ization have been introduced to reduce OOV issues (Winata et al., 2018b;.

Conclusion
In this paper, we propose a novel approach to learn multilingual representations by leveraging monolingual pre-trained embeddings. MME solves the dependencies on the language identification in code-switching Named Entity Recognition task since it utilizes more information from semantically similar embeddings. The experiment results show that our method surpasses previous works and baselines, achieving the state-of-the-art performance. Moreover, cross-lingual setting experiments demonstrate the generalization ability of MME to an unseen language task.