Code-Switched Named Entity Recognition with Embedding Attention

We describe our work for the CALCS 2018 shared task on named entity recognition on code-switched data. Our system ranked first place for MS Arabic-Egyptian named entity recognition and third place for English-Spanish.


Introduction
The tendency for multilingual speakers to engage in code-switching-i.e, alternating between multiple languages or language varieties-poses important problems for NLP systems: traditional monolingual techniques quickly break down with input from mixed languages. Even for problems such as POS-tagging and language identification, which the community often considers "solved", performance deteriorates proportional to the degree of code-switching in the data. The shared task for the third workshop on Computational Approaches on Linguistic Code-Switching concerned named entity recognition (NER) for two code-switched language pairs (Aguilar et al., 2018): Modern Standard Arabic and Egyptian (MSA-EGY); and English-Spanish (ENG-SPA). Here, we describe our work on the shared task.
Traditional NER systems used to rely heavily on hand-crafted features and gazetteers, but have since been replaced by neural architectures that combine bidirectional LSTMs and CRFs (Lample et al., 2016). Equipped with supervised characterlevel representations and pre-trained unsupervised word embeddings, such neural architectures have not only come to dominate named entity recognition, but have also successfully been applied to code-switched language identification (Samih et al., 2016), which makes them highly suitable for the current task as well.
In this paper, we exploit recent advances in neural NLP systems, tailored to code-switching. We use high-quality FastText embeddings trained on Common Crawl  and employ shortcut-stacked sentence encoders (Nie and Bansal, 2017) to obtain deep token-level representations to feed into the CRF. In addition, we make use of an embedding-level attention mechanism that learns task-specific attention weights for multilingual and character-level representations, inspired by context-attentive embeddings (Kiela et al., 2018). In what follows, we describe our system in detail.

Approach
The input data consists of noisy user-generated social media text collected from Twitter. Codeswitching can occur between different tweets in the training data, with many tweets being monolingual, but can also occur within tweets (e.g. " [USER]: en los finales be like [URL]") or even morphologically within words (e.g. "pero esta twitteando y pitchandome los textos"). The goal is to predict the correct IOB entity type for the following categories: The first work to combine CRFs with modern neural representation learning for NER is, to our knowledge, by Collobert et al. (2011). Our architecture is similar to more recent neural architectures for NER, e.g. Huang et al. (2015); Lample et al. (2016); Ma and Hovy (2016). Instead of using a straightforward bidirectional LSTM (BiL-STM), we use several layers and add shortcut connections. Instead of simply feeding in word (and/or character) embeddings, we add a selfattention mechanism.

Embedding Attention
We represent the input tweets on the word level and character level. For all available words in the data, we obtained FastText embeddings trained on Common Crawl and Wikipedia 1 for each language. For every word, we try to find an exact match in the FastText embeddings, or if that is not available we check if it is present in lower case. When a word embedding is available in one language but not in the other, it is initialized as a zero-vector in the second language. Totally unseen words are initialized uniformly at random in the range [−0.1, 0.1]. Thus, for every language pair, we obtain word embeddings w L .
On the character level, we encode every word using a BiLSTM, to which we apply max-pooling to obtain the token-level representation. That is, for a sequence of T characters, {c t } t=1,...,T a standard BiLSTM computes two sets of T hidden states, one for each direction. The hidden states are subsequently concatenated for each timestep to obtain the final hidden states, after which a max-pooling operation is applied over their components: ..,T ) We take inspiration from context-attentive embeddings (Kiela et al., 2018), in that we learn weights over the embeddings, but do not include the contextual dependency for reasons of efficiency given the shared task's tight deadline. That is, we combine the language-specific word embeddings w L 1 and w L 2 with the character-level word representation via a simple self-attention mechanism:

Capitalization
Additionally, we concatenate an embedding to indicate the capitalization of the word, which be either no-capitals, starting-with-capitals or allcapitals: This is already captured by the character-level encoder, but made more explicit using this method.

Shortcut-Stacked Sentence Encoders
The final word representations w are fed into a stacked BiLSTM with residual connections (i.e., "shortcuts"). This type of architecture has been found to work well for text classification, in conjunction with a final max-pooling operation (Nie and Bansal, 2017). Denoting the input and hidden state of the i-th stacked BiLSTM layer at timestep t as x i t and h i t respectively, we have:

CRFs for NER
The hidden states of the last stacked BiLSTM layer are fed into a CRF (Lafferty et al., 2001). CRFs are used to estimate probabilities for entire sequences of tags s corresponding to sequences of tokens x: .
To make the CRF tractable, the potentials must look only at local features. We experiment with two different score functions ψ j . One that uses bigrams: where W ∈ R |S|×|S|×H is w but unflattened, |S| is the number of possible tags, H is the dimensionality of the encoder's features x and B ∈ R |S|×|S| is a bias matrix; and a smaller score function with unigrams: where instead W ∈ R |S|×H . The terms in the score function can be thought of as the emission and transition potentials, respectively.

Preprocessing
The noisy nature of the data makes it necessary to apply appropriate preprocessing steps. We apply the following steps to the Twitter data: • Replaced URLs with [url] • Replaced users (starting with @) with [user] • Replaced hashtags (starting with # but not followed by a number) with [hash tag] • Replaced punctuation tokens with [punct] • Replaced integer and real numbers by [num] • • Replaced emojis 2 by [emoji] In addition, we found that the Arabic tokenizer may have been imperfect: some words still had punctuation attached to them. In order to mitigate this, we removed any leading and trailing punctuation from tokens for MSA-EGY.

Training
The LSTMs are initialized orthogonally (Saxe et al., 2013), and the attention mechanism is initialized with Xavier (Glorot and Bengio, 2010). Word embeddings are kept fixed during training, but character embeddings and capitalization embeddings are updated. We set dropout to 0.5 and optimize using Adam (Kingma and Ba, 2014) with a learning rate of 4e −4 and batch size of

Results & Discussion
For both tasks, we compare the proposed model to a simpler baseline where we simply concatenate the FastText embeddings as input to the network. Table 1 shows the results for ENG-SPA. We observe that our system outperforms the baseline on the test set. The dev set for this task was very small (832, versus a test set of 15.6k), which explains the discrepancy between dev set and test set performance-this discrepancy also made it difficult to tune hyperparameters properly for this task. We also tried a very simple ensembling strategy, where we took our top three models and randomly sampled a response, which only marginally improved test score performance to 62.67. We did not pursue proper ensembling due to time constraints. The best performing model had hidden dimensions [128,128,128] and used the bigram CRF.
The results for the MSA-EGY task are reported in Table 2. While English and Spanish are two distinct languages, Modern Standard Arabic and Egyptian are more closely related, leading to interesting challenges. We observe a similar improvement in this task. As noted in the previous section, we did find that this task required slightly different preprocessing. We did not try any ensembling strategies on this task. The best performing model   Tables 3 and  4 show a breakdown of the performance per task by category on the respective test sets. It is interesting to observe that the Title category is consistently hard for both tasks. The Other category was perfectly handled for MSA-EGY, while this was very bad for ENG-SPA -this could however also be an artifact, since that category was quite small.
We felt that we could have benefited from having a strong gazetteer, but also believe that this would kind of defeat the purpose of our general neural network architecture, which should not have to rely on those kinds of features.

Conclusion
Dealing with code-switching is a prominent problem in handling noisy user-generated social media data. The tendency for speakers to code-switch poses difficulties for standard NLP pipelines. Here, we described our work on the shared task: we introduced a system that performs selfattention over pre-trained or character-encoded word embeddings together with a shortcut-stacked sentence encoder. The system performed impressively on the task. In the future, we would like to analyze the system to see whether it has indeed learned to "code-switch" via embedding attention.