A Multi-task Approach for Named Entity Recognition in Social Media Data

Named Entity Recognition for social media data is challenging because of its inherent noisiness. In addition to improper grammatical structures, it contains spelling inconsistencies and numerous informal abbreviations. We propose a novel multi-task approach by employing a more general secondary task of Named Entity (NE) segmentation together with the primary task of fine-grained NE categorization. The multi-task neural network architecture learns higher order feature representations from word and character sequences along with basic Part-of-Speech tags and gazetteer information. This neural network acts as a feature extractor to feed a Conditional Random Fields classifier. We were able to obtain the first position in the 3rd Workshop on Noisy User-generated Text (WNUT-2017) with a 41.86% entity F1-score and a 40.24% surface F1-score.


Introduction
Named Entity Recognition (NER) aims at identifying different types of entities, such as people names, companies, location, etc., within a given text. This information is useful for higher-level Natural Language Processing (NLP) applications such as information extraction, summarization, and data mining (Chen et al., 2004;Banko et al., 2007;Aramaki et al., 2009). Learning Named Entities (NEs) from social media is a challenging task mainly because (i) entities usually represent a small part of limited annotated data which makes the task hard to generalize, and (ii) they do not follow strict rules (Ritter et al., 2011;Li et al., 2012). This paper describes a multi-task neural network that aims at generalizing the underneath rules of emerging NEs in user-generated text. In addition to the main category classification task, we employ an auxiliary but related secondary task called NE segmentation (i.e. a binary classification of whether a given token is a NE or not). We use both tasks to jointly train the network. More specifically, the model captures word shapes and some orthographic features at the character level by using a Convolutional Neural Network (CNN). For contextual and syntactical information at the word level, such as word and Partof-Speech (POS) embeddings, the model implements a Bidirectional Long-Short Term Memory (BLSTM) architecture. Finally, to cover wellknown entities, the model uses a dense representation of gazetteers. Once the network is trained, we use it as a feature extractor to feed a Conditional Random Fields (CRF) classifier. The CRF classifier jointly predicts the most likely sequence of labels giving better results than the network itself.
With respect to the participants of the shared task, our approach achieved the best results in both categories: 41.86% F1-score for entities, and 40.24% F1-score for surface forms. The data for this shared task is provided by Derczynski et al. (2017).

Related Work
Traditional NER systems use hand-crafted features, gazetteers and other external resources to perform well (Ratinov and Roth, 2009). Luo et al. (2015) obtain state-of-the-art results by relying on heavily hand-crafted features, which are expensive to develop and maintain. Recently, many studies have outperformed traditional NER systems by applying neural network architectures. For instance, Lample et al. (2016) use a bidirectional LSTM-CRF architecture. They obtain a state-of-theart performance without relying on hand-crafted features. Limsopatham and Collier (2016), who achieved the first place on WNUT-2016 shared task, use a BLSTM neural network to leverage orthographic features. We use a similar approach but we employ CNN and BLSTM in parallel instead of forwarding the CNN output to the BLSTM. Nevertheless, our main contribution resides on Multi-Task Learning (MTL) and a combination of POS tags and gazetteers representation to feed the network.
Recently, MTL has gained significant attention. Researchers have tried to correlate the success of MTL with label entropy, regularizers, training data size, and other aspects (Martínez Alonso and Plank, 2017;Bingel and Søgaard, 2017). For instance, Collobert and Weston (2008) use a multitask network for different NLP tasks and show that the multi-task setting improves generality among shared tasks. In this paper, we take advantage of the multi-task setting by adding a more general secondary task, NE segmentation, along with the primary NE categorization task.

Methodology
This section describes our system 1 in three parts: feature representation, model description 2 , and sequential inference.

Feature Representation
We select features to represent the most relevant aspects of the data for the task. The features are divided into three categories: character, word, and lexicons. Character representation: we use an orthographic encoder similar to that of Limsopatham and Collier (2016) to encapsulate capitalization, punctuation, word shape, and other orthographic features. The only difference is that we handle non-ASCII characters. For instance, the sentence "3rd Workshop !" becomes "ncc Cccccccc p" as we map numbers to 'n', letters to 'c' (or 'C' if capitalized), and punctuation marks to 'p'. Non-ASCII characters are mapped to 'x'. This encoded representation reduces the sparsity of character features and allows us to focus on word shapes 1 https://github.com/tavo91/NER-WNUT17 2 The neural network is implemented using Keras (https://github.com/fchollet/keras) and Theano as backend (http://deeplearning.net/ software/theano/). and punctuation patterns. Once we have an encoded word, we represent each character with a 30-dimensional vector (Ma and Hovy, 2016). We account for a maximum length of 20 characters 3 per word, applying post padding on shorter words and truncating longer words. Word representation: we have two different representations at the word level. The first one uses pre-trained word embeddings trained on 400 million tweets representing each word with 400 dimensions (Godin et al., 2015) 4 . The second one uses Part-of-Speech tags generated by the CMU Twitter POS tagger (Owoputi et al., 2013). The POS tag embeddings are represented by 100dimensional vectors. In order to capture contextual information, we account for a context window of 3 tokens on both words and POS tags, where the target token is in the middle of the window.
We randomly initialize both the character features and the POS tag vectors using a uniform dis- where dim is the dimension of the vectors from each feature representation (He et al., 2015). Lexical representation: we use gazetteers provided by Mishra and Diesner (2016) to help the model improve its precision for well-known entities. For each word we create a binary vector of 6 dimensions (one dimension per class). Each of the vector dimensions is set to one if the word appears in the gazetteers of the related class.

Model Description
Character level CNN: we use a CNN architecture to learn word shapes and some orthographic features at the character level representation (see Figure 1). The characters are embedded into a R d×l dimensional space, where d is the dimension of the features per character and l is the maximum length of characters per word. Then, we take the character embeddings and apply 2-stacked convolutional layers. Following Zhou et al. (2015), we perform a global average pooling 5 instead of the widely used max pooling operation. Finally, the result is passed to a fully-connected layer using a Rectifier Linear Unit (ReLU) activation function, which yields the character-based representation of Figure 1: Orthographic character-based representation of a word (green) using a CNN with 2-stacked convolutional layers. The first layer takes the input from embeddings (red) while the second layer (blue) takes the input from the first convolutional layer. Global Average Pooling is applied after the second convolutional layer. a word. The resulting vector is used as input for the rest of the network. Word level BLSTM: we use a Bidirectional LSTM (Dyer et al., 2015) to learn the contextual information of a sequence of words as described in Figure 2. Word embeddings are initialized with pre-trained Twitter word embeddings from a Skipgram model (Godin et al., 2015) using word2vec (Mikolov et al., 2013). Additionally, we use POS tag embeddings, which are randomly initialized using a uniform distribution. The model receives the concatenation of both POS tags and Twitter word embeddings. The BLSTM layer extracts the features from both forward and backward directions and concatenates the resulting vectors from each direction ([ h; h]). Following Ma and Hovy (2016), we use 100 neurons per direction. The resulting vector is used as input for the rest of the network. Lexicon network: we take the lexical representation vectors of the input words and feed them into a fully-connected layer. We use 32 neurons on this layer and a ReLU activation function. Then, the resulting vector is used as input for the rest of the network. Multi-task network: we create a unified model to predict the NE segmentation and NE categorization tasks simultaneously. Typically, the additional task acts as a regularizer to generalize the model (Goodfellow et al., 2016;Collobert and Weston, 2008). The concatenation of character, word and lexical vectors is fed into the NE segmentation and categorization tasks. We use a single-neuron layer with a sigmoid activation function for the secondary NE segmentation task, whereas for the primary NE categorization task, we employ a 13neuron 6 layer with a softmax activation function. Finally, we add the losses from both tasks and feed the total loss backward during training.

Sequential Inference
The multi-task network predicts probabilities for each token in the input sentence individually. Thus, those individual probabilities do not account for sequential information. We exploit the sequential information by using a Conditional Random Fields 7 classifier over those probabilities. This allows us to jointly predict the most likely sequence of labels for a given sentence instead of performing a word-by-word prediction. More specifically, we take the weights learned by the multi-task neural network and use them as features for the CRF classifier (see Figure 3). Taking weights from the common dense layer captures both of the segmentation and categorization features.

Experimental Settings
We preprocess all the datasets by replacing the URLs with the token <URL> before performing any experiment. Additionally, we use half of development set as validation and the other half as evaluation.
Figure 3: Overall system design. First, the system embeds a sentence into a high-dimensional space and uses CNN, BLSTM, and dense encoders to extract features. Then, it concatenates the resulting vectors of each encoder and performs multi-task. The top left single-node layer represents segmentation (red) while the top right three-node layer represents categorization (blue). Finally, a CRF classifier uses the weights of the common dense layer to perform a sequential classification.
Regarding the network hyper-parameters, in the case of the CNN, we set the kernel size to 3 on both convolutional layers. We also use the same number of filters on both layers: 64. Increasing the number of filters and the number of convolutional layers yields worse results, and it takes significantly more time. In the case of the BLSTM architecture, we add dropout layers before and after the Bidirectional LSTM layers with dropout rates of 0.5. The dropout layers allow the network to reduce overfitting (Srivastava et al., 2014). We also tried using a batch normalization layer instead of dropouts, but the experiment yielded worse results. The training of the whole neural network is conducted using a batch size of 500 samples, and 150 epochs. Additionally, we compile the model using the AdaMax optimizer (Kingma and Ba, 2014). Accuracy and F1-score are used as evaluation metrics.
For sequential inference, the CRF classifier uses L-BFGS as a training algorithm with L1 and L2 regularization. The penalties for L1 and L2 are 1.0 and 1.0e −3 , respectively.

Results and Discussion
We compare the results of the multi-task neural network itself and the CRF classifier on each of our experiments. The latter one always shows the best results, which emphasizes the importance of sequential information. The results of the CRF, using the development set, are in Table 1.
Moreover, the addition of a secondary task allows the CRF to use more relevant features from  the network improving its results from a F1-score of 52.42% to 54.12%. Our finding that a multitask architecture is generally preferable over the single task architecture is consistent with prior research (Søgaard and Goldberg, 2016;Collobert and Weston, 2008;Attia et al., 2016;Maharjan et al., 2017). We also study the relevance of our features by performing multiple experiments with the same architecture and different combinations of features. For instance, removing gazetteers from the model drops the results from 54.12% to 52.69%. Similarly, removing POS tags gives worse results (51.12%). Among many combinations, the feature set presented in Section 3.1 yields the best results.
The final results of our submission to the WNUT-2017 shared task are shown in Table 2. Our approach obtains the best results for the person and location categories. It is less effective for corporation, and the most difficult categories for our system are creative-work and product. Our intuition is that the latter two classes are the most difficult to predict for because they grow faster and have less restrictive patterns than the rest. For instance, products can have any type of letters or numbers in their names, or in the case of creative works, as many words as their titles can hold (e.g.  name of movies, books, songs, etc.). Regarding the shared-task metrics, our approach achieves a 41.86% F1-score for entities and 40.24% for surface forms. Table 3 shows that our system yields similar results to the other participants on both metrics. In general, the final scores are low which states the difficulty of the task and that the problem is far from being solved.

Error Analysis
By evaluating the errors made by the CRF classifier, we find that the NE boundaries are a problem. For instance, when a NE is preceded by an article starting with a capitalized letter, the model includes the article as if it were part of the NE. This behavior may be caused by the capitalization features captured by the CNN network. Similarly, if a NE is followed by a conjunction and another NE, the classifier tends to join both NEs as if the conjunction were part of a single unified entity. Another common problem shown by the classifier is that fully-capitalized NEs are disregarded most of the time. This pattern may be related to the switch of domains in the training and testing phases. For instance, some Twitter informal abbreviations 8 may appear fully-capitalized but they do not represent NEs, whereas in Reddit and Stack Overflow fully-capitalized words are more likely to describe NEs.

Conclusion
We show that our multi-task neural network is capable of extracting relevant features from noisy user-generated text. We also show that a CRF classifier can boost the neural network results because it uses the whole sentence to predict the most likely set of labels. Additionally, our approach emphasizes the importance of POS tags in conjunction with gazetteers for NER tasks. Twitter word embeddings and orthographic character embeddings are also relevant for the task.
Finally, our ongoing work aims at improving these results by getting a better understanding of the strengths and weaknesses of our model. We also plan to evaluate the current system in related tasks where noise and emerging NEs are prevalent.