Arabic Named Entity Recognition: What Works and What’s Next

This paper presents the winning solution to the Arabic Named Entity Recognition challenge run by Topcoder.com. The proposed model integrates various tailored techniques together, including representation learning, feature engineering, sequence labeling, and ensemble learning. The final model achieves a test F_1 score of 75.82% on the AQMAR dataset and outperforms baselines by a large margin. Detailed analyses are conducted to reveal both its strengths and limitations. Specifically, we observe that (1) representation learning modules can significantly boost the performance but requires a proper pre-processing and (2) the resulting embedding can be further enhanced with feature engineering due to the limited size of the training data. All implementations and pre-trained models are made public.


Introduction
Aiming to identify entities in natural language, named entity recognition (NER) serves as one of the fundamental steps in various applications. In many languages, the performance of NER has been significantly improved because of recent advances in representation learning (Peters et al., 2018;Akbik et al., 2018). To promote the development of Arabic NER, a challenge was hosted on Topcoder.com 2 based on the public Arabic NER benchmark dataset (i.e., the AQMAR dataset) (Mohit et al., 2012). Challenge submissions were required to only use annotations from the training set, and manual reviews on the submitted solutions were further conducted to prevent cheating.
1 https://github.com/LiyuanLucasLiu/ ArabicNER 2 https://www.topcoder.com/challenges/ 30087004 Among 137 registrants competing in the challenge 3 , we placed the first by tailoring various techniques and incorporating them all together. Intuitively, it is hard to only rely on feature engineering to capture textual signals, especially for morphologically rich languages like Arabic (Habash, 2010). At the same time, neural networks have demonstrated their great potentials to automate high-quality representation construction in an endto-end manner. Therefore, we leverage embedding modules to represent words with pre-trained vectors for a better quality. Besides, we observe that handcrafted features can bring a considerable improvement. Consuming all these features, we train multiple LSTM-CRF models to construct the mapping from representations to predictions, and further aggregate their outputs with ensemble learning. Moreover, we incorporate a dictionary-based string matching model and observe that it can improve the recall at some cost of precision, which results in a marginal F 1 -score improvement. Our final ensemble model achieves a test F 1 score of 75.82%, outperforming all other participants as well as the previous state-of-the-arts by significant margins. We further conduct analyses on our solution to get deeper insights on the task: (1) the effectiveness of representation learning and (2) the role of feature engineering.
The rest of paper is organized as follow. The next section discusses related work. Section 3 introduces the problem setting and presents the data analysis. The proposed framework is presented in Section 4, including model ensemble and dictionary-based model. Tailored representations modules are introduced in Section 5. Finally, we discuss the experimental results in Section 6.

Related Work
Typically, named entity recognition is conducted as a sequence labeling task. Before deep learning demonstrated its effectiveness, traditional methods rely on handcrafted features (e.g., features based on POS tags) and language-specific resources (e.g., gazetteers) to capture textual signals. Machine learning models like conditional random field (CRF) and hidden Markov model (HMM) are employed to capture the label dependency (Lafferty et al., 2001;Florian et al., 2003;Chieu and Ng, 2002). Many attempts have been made to reduce the reliance on feature engineering or other human endeavors, which makes the NER task be solved in an end-to-end manner (Lample et al., 2016;Ma and Hovy, 2016;Shang et al., 2018). Recent studies have revealed that language model is an effective representation module for NER (Peters et al., 2017(Peters et al., , 2018Liu et al., 2018b;Akbik et al., 2018;Liu et al., 2018a).
At the same time, many approaches have been proposed specifically to solve the NER task in Arabic. Traditional Arabic NER models are mostly rule-basedmodels (Shaalan, 2014). Recently, people have started to attach this task with machine learning methods (Helwe and Elbassuoni, 2017;Gridach, 2016). To further improve the performance, attempts have been made to combine both rule-based and learning-based approaches into a unified framework (Pasha et al., 2014;Abdelali et al., 2016). Besides, incorporating additional supervision from other domains or languages has been explored as well (Darwish, 2013).

Problem Setting
In this section, we first introduce the problem setting of sequence labeling. Then, we discuss the aforementioned Arabic NER challenge.

Sequence Labeling
In the sequence labeling framework, NER problems are usually annotated following the labeling schemes like BIO and IOBES. These labeling schemes help us encode the information about entities (Ratinov and Roth, 2009). For example, in the BIO scheme, when a token sequence is identified as a named entity, its starting token and middle/end tokens are labeled as B-and I-followed by the type; and all other words are labeled as O. The IOBES scheme is similar to BIO but further use S-for singleton entity and E-for end-ofentity, respectively.
Using such labels, we define the input sequence as X = {x 1 , x 2 , . . . , x T }, where x i is i-th token and its label is y i . Moreover, we define the character-level input for X as C = {c 1,1 , c 1,2 , · · · , c 1, , c 2,1 , · · · , c T, }, where {c i,1 , · · · , c i, } are the characters contained in the word x i and c i, is the space character right after x i . Then, the goal of NER becomes to predict the label y i for each token x i in the input sequence X.

Arabic NER Challenge
The Arabic NER challenge uses the public Arabic NER benchmark dataset (i.e., the AQMAR dataset) (Mohit et al., 2012). Its annotated entities are classified into four types (i.e., "Person", "Location", "Organization" and "Miscellaneous"). This dataset contains 28 hand-annotated Arabic Wikipedia articles, 14 articles are used as the training set, 7 articles are used as the development set, and 7 articles are used as the test set.
Data cleaning is further conducted on this dataset. Specifically, we observed that the label sequence is encoded in a noisy manner. For example, some entities are labelled as {B-, O, I-}, while the legit label sequence should be {B-, I-, I-}; Some entities are labelled as {B-T 0 , I-T 1 } (here, T 0 and T 1 are two different entity types), while the legit label sequence should be {B-T 0 , B-T 1 }. In the pursuit of more powerful models and more meaningful comparisons, we conduct a label cleaning to regularize the label sequence. The resulting dataset is released for future study 4 , and its statistics are summarized in Table ??. In the following sections, all comparisons are conducted on this cleaned dataset.

Model Framework
As visualized in Figure 1, we design a heterogeneous framework, which incorporates various techniques: (1) It employs representation learning and sequence labeling as the basic sequence labeling model; (2) It leverages ensemble learning to combine outputs from different NER models; and (3) It further incorporates a dictionary-based string matching model.

Sequence Labeling Model
As to the basic sequence labeling model, we assume there are n different representation modules, Given the j-th token in the input sequence, the representation vector produced by module M i is denoted as f i,j . In this paper, we concatenate the output from different modules as the representation (input of LSTM-CRF), i.e., f j = [f 1,j ; f 2,j ; · · · ; f n,j ]. Given the input sequence X, we define its token representations as F = {f 1 , f 2 , · · · , f T }. Building upon representation modules, we use LSTM-CRF (Huang et al., 2015) to conduct entity extraction: we first feed F into Bi-LSTMs, whose outputs are marked as Z = {z 1 , z 2 , · · · , z T }. A linear-chain CRF is further leveraged to model the whole label sequence simultaneously. Specifically, for the input sequence Z, CRF defines the conditional probability of Y = {y 1 , · · · , y T } as whereŶ = {ŷ 1 , · · · ,ŷ T } is a possible label sequence, Y(Z) refers to the set of all possible label sequences for Z, and φ(y t−1 , y t , z t ) is the potential function of the CRF. In this paper, we define the potential function as: where W yt and b y t−1 ,yt are the weight and bias. During the model training, we use the negative log-likelihood of Equation 1 as the loss function. In the inference stage, the predicted label sequence for input X is the one maximizing the probability in Equation 1. Although the denominator in Equation 1 contains an exponential number of terms 5 , due to the definition of the potential function, both training and inference can be efficiently conducted using dynamic programming.
The dictionary-based NER model and representation learning modules would be introduced in the following sections.

Sequence Labeling Model Ensemble
To get better performance, we applied the ensemble learning on sequence labeling results. Specifically, as in Figure 1, multiple NER models are separately trained with the shared representation modules, and their results are combined as the final output.

Dictionary-based NER Model
Besides the sequence labeling ensemble model, we also incorporate a dictionary-based NER model. Specifically, we first build a dictionary to map surface names to their types from the training set, then apply this dictionary via string matching. We will add the dictionary-extracted entities into the final prediction, if and only if they do not conflict with the sequence labeling results. For example, in Figure 1, since the two-word entity (i.e., B-LOC I-LOC) detected by the dictionary-based model overlaps with the sequence labeling results, this entity is dropped; At the same time, because the one-word entity (i.e., the second B-LOC) detected by the dictionary-based model is not overlapped with any entities detected by the sequence labeling model, it is therefore integrated to the final results. In our experiments, we found this enrichment by the dictionary-based model improves the recall at a relatively smaller cost of the precision, thus improving the F 1 score.

Representation Learning Modules
In this section, we introduce the three representation learning modules: (1) word embedding, (2) contextualized representation, and (3) handcrafted features.

Word Embedding
Based on the distributional hypothesis (i.e., "a word is characterized by the company it keeps" (Harris, 1954)), word embedding methods aim to learn the distributed representations by analyzing their contexts (Mikolov et al., 2013). Recent work shows that word embedding could uncover textual information of various levels (Artetxe et al., 2018). Hence, we leverage word embedding as a part of the word representation. Due to the limited size of the training set, we fix the pre-trained word embedding during the training of NER models. When the pre-trained embedding has a high dimension, we will add a linear projection to further project them to a relatively low dimension.

Contextualized Representation
Contextualized representations have been widely adopted in the state-of-the-art sequence labeling models. Typically, they rely on bidirectional neural language models to capture the local contextual information before and after a certain word. Such representations provide rich, supplementary information to the context-agnostic information contained in a word embedding. Specifically, character-level language models are first used to provide additional supervision (Liu et al., 2018b), and further exploration observes its effectiveness as the pre-training task to construct contextualized word representation (Akbik et al., 2018).
We present the details of character-level language modeling and integration as below.
Character-Level Language Modeling. A bidirectional character-level language model contains two character-level language models to capture information from two directions. Characterlevel language modeling aims to model the probability distribution of the character sequence. Typically, the probability of the sequence {c 1 , · · · , c T } is defined in a "forward" manner: p(c 1 , · · · , c T ) = T t=1 p(c t |c 1 , · · · , c t−1 ). To calculate this conditional probability, we first map the input sequence C to a list of character embedding vectors and pass them into a re-current neural network, whose output is referred to h t . Then, the probability p(c t |c 1 , · · · , c t−1 ) is calculated using the softmax function. The backward language model is the same as the forward language model, except that it decomposes the probability of the sequence {c 1 , · · · , c T } from the end to the front as p(c 1 , · · · , c T ) = T t=1 p(c t |c t+1 , · · · , c T ). Its output for character c t is denoted as h r t . Both language models use negative log-likelihood as the training objective.
Language Model Integration. Using the bidirectional character-level language models, we construct contextualized representations for each word. Specifically, we feed the input character sequence C to language models, and then concatenate the hidden state of the forward language model at c i, and the hidden state of the backward language model at c i−1, as the representations for x i . We refer these two hidden states as h i and h r i . Due to the complex nature of natural language, large dimensions of h i and h r i are usually required in language models, which might lead to overfitting in the NER task. To avoid such cases, we add a linear transformation layer to project h i and h r i to a lower dimension. In details, we use where W cr and b cr are parameters to learn during the training of NER models. The output r i is the contextualized representation for x i .

Handcrafted Features
Due to the limited amount of available annotations, we further handcraft word shape features to help the model better capture the textual features. Specifically, all words are classified into three classes: (1) We mark all numbers as "num"; (2) For remaining words, if it contains English characters, it would be marked as "en"; (3) Otherwise, it would be marked as "ar". These three categories would be further mapped to three different vectors as the token representation.
Although these handcrafted features are quite simple, similar to existing work (Dozat, 2016), it results in a remarkable performance improvement in our experiments. More discussion on this feature engineering design is included in Section 6.

Experiments
In this section, we present the experimental results on the AQMAR dataset.

Implementation Detail
As to pre-trained language models, we conduct training on the Arabic Wikipedia texts with a vocabulary of 256 characters (out-of-vocabulary characters are mapped to a special <UNK> character). Since the resulting language model would be used to construct contextualized representations for the downstream task, whose input would be space separated, we conduct further preprocessing. Specifically, we first tokenize the text, then concatenate the token sequence by space. To demonstrate the importance of pre-processing, we trained two kinds of language models, one with pre-processing, and the other without.
For pre-trained word embedding, we adopt two sets of pre-trained embedding. One is trained with the word2vec model (Mikolov et al., 2013). It has 100 dimensions and is public available 6 . The other is trained with the Fasttext model (Bojanowski et al., 2017), which is released together with 156 other languages 7 . It has 300 dimensions and would be projected to 100 dimensions before concatenating with other vectors.

Hyper-parameter
For language model training, we use Nadam (Dozat, 2016) as the optimizer, set the learning rate as 0.002, clip the gradient at 1, set the batch size as 128 and limit the back propagation length to 256. As to the RNN, we use one-layer LSTMs with 2048 hidden states. We set its character embedding to be 128 dimensional and project its outputs to 50 dimension before concatenating with other vectors.
As to the sequence labeling task, we use LSTMs with 250 hidden states in the LSTM-CRF layer, and apply dropout with a ratio of 0.5, and use additional word dropout to each representation module with a ratio of 0.1. Following the previous work (Reimers and Gurevych, 2017), we use Nadam (Dozat, 2016) as the optimizer, set the learning rate as 0.002, clip the learning rate at 1 and set the batch size as 32.

Performance Comparison
As summarized in Table 2 Ablation Study Setting. In the ablation study, we first detach the dictionary-based NER from the resulting system and refer ensemble sequence labeling model as "-Dict-based". Then, we refer the basic sequence labeling model as "-Ensemble". After that, we detach hand-crafted features and mark the resulting model as "-Word shape". Pre-processing is further removed from language model training, which is marked as "-Pre-process". In the end, we remove language model which leads to a typical LSTM-CRF model (Huang et al., 2015) with pre-trained word embedding, we refer this model as "-Language model". Their results are summarized in Table 2.
Discussion. We find that the dictionary-based NER model 8 improves the recall at the cost of the precision and improves the F 1 score by a small margin. Also, we observe that the results demonstrate the effectiveness of ensemble learning. At the same time, we find the major F 1 improvements come from a better capturing of task-related signals. For example, by properly adding language models or designing handcrafted features, the F 1 boosts significantly. It verifies the effectiveness of contextualized representation. Also, it reveals the weakness of these techniques. Specifically, although the constructed character-level language model has the potential to capture the word shape signals, adding handcrafted features (i.e., word shape) can improve the F 1 from 71.47% to 73.80%. We conjugate this is caused by the limited size of training data with English entities, which limits the model from properly constructing task-related representations. Further comparison between these two models finds their major differences are the predictions for entities containing both Arabic and English and validates our intuition. Besides, we find the pre-processing used in language model training is crucial for the performance, which has a big impact on the model performance (from 66.07% to 71.47%). The main reason is that although pre-trained language models are powerful, they are agnostic to the target task corpus and suffer from their differences.

Conclusion
In this paper, we introduce the winning solution to the Arabic Named Entity Recognition challenge. First of all, we give a detailed introduction on system design and the integrated technologies. We further conduct ablation study to reveal the effectiveness of each module and figure out all modules bring performance improvements. We observe that properly capturing the task-related features is crucial to the performance. We also noticed the current contextualized representation learning techniques, although effective, could be further enhanced by incorporating handcrafted features to better handle some corner cases.