GHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks

This paper describes our system submission to the CALCS 2018 shared task on named entity recognition on code-switched data for the language variant pair of Modern Standard Arabic and Egyptian dialectal Arabic. We build a a Deep Neural Network that combines word and character-based representations in convolutional and recurrent networks with a CRF layer. The model is augmented with stacked layers of enriched information such pre-trained embeddings, Brown clusters and named entity gazetteers. Our system is ranked second among those participating in the shared task achieving an FB1 average of 70.09%.


Introduction
The CALCS 2018 shared task (Aguilar et al., 2018) is about performing named entity recognition (NER) on Modern Standard Arabic (MSA) -Egyptian Arabic (EGY) code-switched tweets. Unlike previous shared tasks on code-switching, the data provided contains no code-switching annotation. Only nine categories of named entities are annotated using BIO tagging. While this makes the task a "pure" NER task, the difficulty is to design a model which can cope with the noise introduced by code-switching, challenging old systems tailored around MSA.
More recent work relies on neural networks. A number of architecture variants have proven to be effective (Huang et al., 2015;Lample et al., 2016;Chiu and Nichols, 2016;Ma and Hovy, 2016;Reimers and Gurevych, 2017). What they have in common is that they use a bidirectional LSTM (bi-LSTM) over vector representations of the input words in order model their left and right contexts. On top of the bi-LSTM, they use a CRF layer to take the final tagging decisions. Other than a softmax layer which would treat tagging decisions independently, the CRF is able to model the linear dependencies between labels. This is essential for NER, where for instance, B-LOCATION cannot be followed by I-PERSON. The architectures differ in their way of obtaining a vector representation for the input words. For instance, in Lample et al. (2016), each word embedding is obtained as a concatenation of the output of a bidirectional LSTM (bi-LSTM) over its characters and a pretrained word vector. Ma and Hovy (2016) use convolutions over character embeddings with maxpooling for obtaining morphological features from the character level, similar to Chiu and Nichols (2016).
Our system also relies on the bi-LSTM-CRF architecture. As input representation, we use both word embeddings and a character-level representation based on CNNs. Our system additionally employs a Brown Cluster representation, oversampling, and NE gazetteers.
The remainder of the paper is structured as follows in the following section, we provide a short decription of the task and the data set. Sect. 3 describes our system in detail. Sect. 4 presents our experiments, and Sect. 5 concludes the paper.

Task and Data Description
The shared task posed the problem of performing named-entity recognition on code-switched data given nine categories, namely PERSON, LOCA-TION, ORGANIZATION, GROUP, TITLE, PROD-UCT, EVENT, TIME, OTHER.
The training set contains 10,100 tweets and 204,286 tokens, with an average tweet length of 20.2 tokens and 91.5 characters. 11.3% of all tokens are labeled as named entities. The most frequent category is PERSON with 4.3% of all tokens, followed by LOCATION (2.2%), GROUP and ORGANIZATION (1.3% each), as well as TITLE (1%). All other categories cover less than 1% of all tokens each, the least frequent category being OTHER (0.06%).
The validation set contains 1,122 tweets and 22,742 tokens, and exhibits similar average tweets lengths, as well as a similar distribution of labels.

System Description
We used a DNN model which is mainly suited for sequence tagging. It is a variant of the bi-LSTM-CRF architecture proposed by Ma and Hovy (2016); Lample et al. (2016); Huang et al. (2015). 1 It combines a double representation of the input words by using word embeddings and a character-based representation (with CNNs). The input sequence is processed with bi-LSTMs, and the output layer is a linear chain CRF. The model uses the following.
Word-level embeddings allow the learning algorithms to use large unlabeled data to generalize beyond the seen training data. We explore randomly initialized embeddings based on the seen training data and pre-trained embedding.
We train our word embeddings using word2vec (Mikolov et al., 2013) on a corpus we crawled from the web with total size of 383,261,475 words, consisting of dialectal texts from Facebook posts (8,241,244), Twitter tweets (2,813,016), user comments on the news (95,241,480), and MSA texts of news articles (from Al-Jazeera and Al-Ahram) of 276,965,735 words.
Character-level CNNs have proven effective for various NLP tasks due to their ability to extract sub-word information (ex. prefixes or suffixes) and to encode character-level representations of words (Collobert et al., 2011;Chiu and Nichols, 2016;dos Santos and Guimarães, 2015).
Bi-LSTM Recurrent neural networks (RNN) are well suited for modeling sequential data, achieving ground-breaking results in many NLP tasks (e.g., machine translation).
Bi-LSTMs (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) are capable of learning long-term dependencies and maintaining contextual features from both past and future states while avoiding the vanishing/exploding gradients problem. They consist of two separate bidirectional hidden layers that feed forward to the same output layer.
CRF is used jointly with bi-LSTMs to avoid the output label independence assumptions of bi-LSTMs and to impose sequence labeling constraints as in Lample et al. (2016).
Brown clusters (BC) Brown clustering is an unsupervised learning method where words are grouped based on the contexts in which they appear (Brown et al., 1992). The assumption is that words that behave in similar ways tend to appear in similar contexts and hence belong to the same cluster. BCs can be learned from large unlabeled texts and have been shown to improve POS tagging (Owoputi et al., 2013;Stratos and Collins, 2015). We test the effectiveness of using Brown clusters in the context of named entity recognition in a DNN model. We train BCs on our crawled code-switched corpus of 380 million words (mentioned above) with 100 Brown Clusters.

Named Entity Gazetteers
We use a large collec- The architecture of our model is shown in Figure 1. For each word in the sequence, the CNN computes the character-level representation with character embeddings as inputs. Then the character-level representation vector is concatenated with both word embeddings vector and feature embedding vectors (Brown Clusters and Gazetteers) to feed into the bi-LSTM layer. Finally, an affine transformation followed by a CRF is applied over the hidden representation of the bi-LSTM to obtain the probability distribution over all the named entity labels. Training is performed using stochastic gradient descent with momentum of 0.9 and batch size equal to 150. We employ dropout (Hinton et al., 2012) and early-stopping (Caruana et al., 2000) (with patience of 35) to mitigate overfitting. We use the hyper-parameters detailed in Table 1.
The only preprocessing operation we conducted on the data was to convert it into Buckwalter transliteration (a character-to-character mapping) in order to avoid the complexity of dealing with UTF-8 characters.   Baseline. We use word representations only with randomly-initialized embeddings. It is to be mentioned that the shared task baseline for the test set is 62.71%.
Word+Chars. We add character representations in a one-dimensional CNN layer.
Word+Chars+Embed. We use pre-trained embeddings for words trained on a corpus of about 380 million words (described above) consisting of dialectal Egyptian and MSA data.
Word+Chars+Embed+BC. We add Brown Clusters (BC) to the network.
Word+Chars+Embed+BC+OS. We add oversampling (OS) to the network. We conduct oversampling by heuristically making 10-fold repetitions of sentences containing minority labels, in this case all classes other than the "O" label.
Word+Chars+Embed+BC+GZ. We further add a new layer for the named entity gazetteer (GZ).

Label
Total  The results in Table 2 are reported on the validation set (except for the last row), and they show that the DNN model is incrementally improving by adding more features and external resources. The best result is obtained with the aggregation of all features. Table 3 shows a breakdown of our system performance (in terms of accuracy) on the validation set. It also shows the number of instances and the ratio percentage for each label. As the table shows, the category "other" accounts for 88% of the entire data, while all other tags combined make up the remaining 12% which shows an imbalance in the representation of the other categories. Our system performs best with 'B-PER', 'I-PER', 'B-LOC' and 'B-TIME'.
Our system is ranked second among those par-ticipating in the shared task achieving an FB1 average of 70.09% with the first scoring 71.62%, which is a difference of about 1.5% absolute.

Conclusion
We have presented a description of our system participating in the Shared Task on "Named Entity Recognition on Code-switched Data". We build a deep neural network with multiple layers for accommodating various features, such as pre-trained word embeddings, Brown Clustering and named entity gazetteers. We have not relied on any linguistic rules, morphological analyzers or PoS taggers. We also make the different layers as optional plug-ins, which makes our system more adaptable and scalable for languages that do not have similar external resources.