Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition

In countries that speak multiple main languages, mixing up different languages within a conversation is commonly called code-switching. Previous works addressing this challenge mainly focused on word-level aspects such as word embeddings. However, in many cases, languages share common subwords, especially for closely related languages, but also for languages that are seemingly irrelevant. Therefore, we propose Hierarchical Meta-Embeddings (HME) that learn to combine multiple monolingual word-level and subword-level embeddings to create language-agnostic lexical representations. On the task of Named Entity Recognition for English-Spanish code-switching data, our model achieves the state-of-the-art performance in the multilingual settings. We also show that, in cross-lingual settings, our model not only leverages closely related languages, but also learns from languages with different roots. Finally, we show that combining different subunits are crucial for capturing code-switching entities.


Introduction
Code-switching is a phenomenon that often happens between multilingual speakers, in which they switch between their two languages in conversations, hence, it is practically useful to recognize this well in spoken language systems (Winata et al., 2018a). This occurs more often for entities such as organizations or products, which motivates us to focus on the specific problem of Named Entity Recognition (NER) in code-switching scenarios. We show one of the examples as the following: • walking dead le quita el apetito a cualquiera • (translation) walking dead (a movie title) takes away the appetite of anyone For this task, previous works have mostly focused on applying pre-trained word embeddings from each language in order to represent noisy mixed-language texts, and combine them with character-level representations (Trivedi et al., 2018;Winata et al., 2018b). However, despite the effectiveness of such wordlevel approaches, they neglect the importance of subword-level characteristics shared across different languages. Such information is often hard to capture with word embeddings or randomly initialized character-level embeddings. Naturally, we can turn towards subword-level embeddings such as FastText (Grave et al., 2018) to help this task, which will evidently allow us to leverage the morphological structure shared across different languages.
Despite such expected usefulness, there has not been much attention focused around using subword-level features in this task. This is partly because of the non-trivial difficulty of combining different language embeddings in the subword space, which arises from the distinct segmentation into subwords for different languages. This leads us to explore the literature of Meta-Embeddings (Yin and Schütze, 2016;Muromägi et al., 2017;Coates and Bollegala, 2018;, which is a method to learn how to combine different embeddings. In this paper, we propose Hierarchical Meta-Embeddings (HME) 1 which learns how to combine different pre-trained monolingual embeddings in word, subword, and character-level into a single language-agnostic lexical representation without using specific language identifiers. To address the issue of different segmentations, we add a Transformer (Vaswani et al., 2017)   which learns the important subwords in a given sentence. We evaluate our model on the task of Named Entity Recognition for English-Spanish code-switching data, and we use Transformer-CRF, a transformer-based encoder for sequence labeling based on the implementation of Winata et al. (2019). Our experimental results confirm that HME significantly outperforms the state-ofthe-art system in absolute F1 score. The analysis shows that in the task of English-Spanish mixed texts not only similar languages like Portuguese or Catalan help, but also seemingly distant languages from Celtic origin also significantly increase the performance.

Meta-embeddings
Recently, there are studies on combining multiple word embeddings in pre-processing steps (Yin and Schütze, 2016;Muromägi et al., 2017;Coates and Bollegala, 2018). Later,  introduced a method to dynamically learn word-level meta-embeddings, which can be effectively used in a supervised setting. Winata et al. (2019) proposed an idea to leverage multiple embeddings from different languages to generate language-agnostic meta-representations for mixed-language data.

Hierarchical Meta-Embeddings
We propose a method to combine word, subword, and character representations to create a mixture of embeddings. We generate a multilingual metaembeddings of word and subword, and then, we concatenate them with character-level embeddings to generate final word representations, as shown in Figure 1. Let w be a sequence of words with n elements, where w = [w 1 , . . . , w n ]. Each word can be tokenized into a list of subwords s = [s 1 , . . . , s m ] and a list of characters c = [c 1 , . . . , c p ]. The list of subwords s is generated using a function f ; s = f (w). Function f maps a word into a sequence of subwords. Further, let E (w) , E (s) , and E (c) be a set of word, subword, and character embedding lookup tables. Each set consists of different monolingual embeddings. Each element is transformed into a embedding vector in R d . We denote subscripts {i,j} as element and embedding language index, and superscripts (w,s,c) as word, subword, and character.

Multilingual Meta-Embeddings (MME)
We generate a meta-representations by taking the vector representation from multiple monolingual pre-trained embeddings in different subunits such as word and subword. We apply a projection matrix W j to transform the dimensions from the original space x i,j ∈ R d to a new shared space x i,j ∈ R d . Then, we calculate attention weights α i,j ∈ R d with a non-linear scoring function φ (e.g., tanh) to take important information from each individual embedding x i,j . Then, MME is calculated by taking the weighted sum of the projected embeddings x i,j :

Mapping Subwords and Characters to Word-Level Representations
We propose to map subword into word representations and choose byte-pair encodings (BPEs) (Sennrich et al., 2016) since it has a compact vocabulary. First, we apply f to segment words into sets of subwords, and then we extract the pre-trained subword embedding vectors x (s) i,j ∈ R d for language j. Since, each language has a different f , we replace the projection matrix with Transformer (Vaswani et al., 2017) to learn and combine important subwords into a single vector representation. Then, we create u (s) i ∈ R d which represents the subword-level MME by taking the weighted sum of x To combine character-level representations, we apply an encoder to each character.
We combine the word-level, subword-level, and character-level representations by concatenation ∈ R d are word-level MME and BPElevel MME, and u (c) i is a character embedding. We randomly initialize the character embedding and keep it trainable. We fix all subword and word pre-trained embeddings during the training.

Sequence Labeling
To predict the entities, we use Transformer-CRF, a transformer-based encoder followed by a Conditional Random Field (CRF) layer (Lafferty et al., 2001). The CRF layer is useful to constraint the dependencies between labels.
The best output sequence is selected by a forward propagation using the Viterbi algorithm.

Experimental Setup
We train our model for solving Named Entity Recognition on English-Spanish code-switching tweets data from Aguilar et al. (2018). There are nine entity labels with IOB format. The training, development, and testing sets contain 50,757, 832, and 15,634 tweets, respectively. We use FastText word embeddings trained from Common Crawl and Wikipedia (Grave et al., 2018) for English (es), Spanish (es), including four Romance languages: Catalan (ca), Portuguese (pt), French (fr), Italian (it), and a Germanic language: German (de), and five Celtic languages as the distant language group: Breton (br), Welsh (cy), Irish (ga), Scottish Gaelic (gd), Manx (gv). We also add the English Twitter GloVe word embeddings (Pennington et al., 2014) and BPE-based subword embeddings from Heinzerling and Strube (2018). We train our model in two different settings: (1) multilingual setting, we combine main languages (en-es) with Romance languages and a Germanic language, and (2) cross-lingual setting, we use Romance and Germanic languages without main languages. Our model contains four layers of transformer encoders with a hidden size of 200, four heads, and a dropout of 0.1. We use Adam optimizer and start the training with a learning rate of 0.1 and an early stop of 15 iterations. We replace user hashtags and mentions with <USR>, emoji with <EMOJI>, and URL with <URL>. We evaluate our model using absolute F1 score metric.

Baselines
CONCAT We concatenate word embeddings by merging the dimensions of word representations. This method combines embeddings into a highdimensional input that may cause inefficient com-  putation.
LINEAR We sum all word embeddings into a single word vector with equal weight. This method combines embeddings without considering the importance of each of them.
Random Embeddings We use randomly initialized word embeddings and keep it trainable to calculate the lower-bound performance.
Aligned Embeddings We align English and Spanish FastText embeddings using CSLS with two scenarios. We set English (en) as the source language and Spanish (es) (en → es) as the target language, and vice versa (es → en). We run MUSE by using the code prepared by the authors of Conneau et al. (2017). 2

Results & Discussion
In general, from Table 1, we can see that wordlevel meta-embeddings even without subword or character-level information, consistently perform better than flat baselines (e.g., CONCAT and LIN-EAR) in all settings. This is mainly because of the attention layer which does not require additional parameters. Furthermore, comparing our approach to previous state-of-the-art models, we can clearly see that our proposed approaches all significantly outperform them. From Table 1, in Multilingual setting, which trains with the main languages, it is evident that adding both closely-related and distant language embeddings improves the performance. This shows us that our model is able to leverage the lexical similarity between the languages. This is more distinctly shown in Cross-lingual setting as using distant languages significantly perform less than using closely-related ones (e.g., ca-pt). Interestingly, for distant languages, when adding subwords, we can still see a drastic performance increase. We hypothesize that even though the characters are mostly different, the lexical structure is similar to our main languages.
On the other hand, adding subword inputs to the model is consistently better than characters. This is due to the transfer of the information from the pre-trained subword embeddings. As shown in Ta- Figure 2: Heatmap of attention over languages from a validation sample. Left: word-level MME, Right: BPE-level MME. We extract the attention weights from a multilingual model (en-es-ca-pt-de-fr-it). ble 1, subword embeddings is more effective for distant languages (Celtic languages) than closelyrelated languages such as Catalan or Portuguese.
Moreover, we visualize the attention weights of the model in word and subword-level to interpret the model dynamics. From the left image of Figure 2, in word-level, the model mostly chooses the correct language embedding for each word, but also combines with different languages. Without any language identifiers, it is impressive to see that our model learns to attend to the right languages. The right side of Figure 2, which shows attention weight distributions for subword-level, demonstrates interesting behaviors, in which for most English subwords, the model leverages ca, fr, and de embeddings. We hypothesize this is because the dataset is mainly constructed with Spanish words, which can also be verified from Figure 3 in which most NER tags are classified as es.

Conclusion
We propose Hierarchical Meta-Embeddings (HME) that learns how to combine multiple monolingual word-level and subword-level embeddings to create language-agnostic representations without specific language information. We achieve the state-of-the-art results on the task of Named Entity Recognition for English-Spanish code-switching data. We also show that our model can leverage subword information very effectively from languages from different roots to generate better word representations.