Transfer Learning and Sentence Level Features for Named Entity Recognition on Tweets

We present our system for the WNUT 2017 Named Entity Recognition challenge on Twitter data. We describe two modifications of a basic neural network architecture for sequence tagging. First, we show how we exploit additional labeled data, where the Named Entity tags differ from the target task. Then, we propose a way to incorporate sentence level features. Our system uses both methods and ranked second for entity level annotations, achieving an F1-score of 40.78, and second for surface form annotations, achieving an F1-score of 39.33.


Introduction
Named Entity Recognition (NER) is an important Natural Language Processing task. Its goal is to tag entities such as names of people and locations in text. State-of-the-art systems can achieve F1-scores of up to 92 points on English news texts (Chiu and Nichols, 2015). Achieving good performance on more complex domains such as user generated texts on social media is still a hard problem. The best system submitted for the WNUT 2016 shared task achieved an F1-score of 52.41 on English Twitter data (Strauss et al., 2016).
In this work, we present our submission for the WNUT 2017 shared task on "Novel and Emerging Entity Recognition" (Derczynski et al., 2017). We extend a basic neural network architecture for sequence tagging (Chiu and Nichols, 2015;Collobert et al., 2011) by incorporating sentence level feature vectors and exploiting additional labeled data for transfer learning. We build on and take inspiration from recent work from (Falkner et al., 2017;Sileo et al., 2017) on NER for French Twitter data (Lopez et al., 2017).
Our submitted solution reached an F1-score of 41.76 for entity level annotations and 57.98 on surface form annotations. This places us second on entity level annotations, where the best system achieved an F1-score of 41.90, and fourth on surface form annotations, where the best system achieved an F1-score of 66.59.

System Description
Our solution is based on a sequence labeling system that uses a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) which extracts features for training a Conditional Random Field (Sutton and McCallum, 2012). We apply a transfer learning approach, since previous research has shown that this can improve sequence labeling systems (Yang et al., 2017). More precisely, we modify the base system to allow for joint training on the WNUT 2016 corpus (Strauss et al., 2016), which uses a different tag set than our target task. In addition, we extend the system to incorporate sentence level feature vectors. All these methods are combined to build the system that we used for our submission to the WNUT 2017 shared task. Figure 1 shows an overview of the different architectures, which are described in detail in the following sections. Figure 1a shows an overview of our base system. We use a bidirectional Long Short Term Memory network (LSTM) (Hochreiter and Schmidhuber, 1997) to learn the potential function for a linear chain Conditional Random Field (CRF) (Sutton and McCallum, 2012) to predict a sequence of Named Entity tags y 1:T from a sequence of feature vectors x 1:T . This is based on an architecture previously used in (Chiu and Nichols, 2015), which   (Tjong Kim Sang and De Meulder, 2003).

Basic Sequence Labeling System
Bidirectional LSTM: For every word in w t in a given input sentence w 1:T , we first compute a feature vector x t , which is the concatenation of all the word level features described in Section 2.5. The sequence of feature vectors x 1:T is then fed to a bidirectional LSTM. The output of both the forward and backward LSTM are concatenated to get o 1:T , which get passed through a Rectified Linear Unit, (ReLU ) (Nair and Hinton, 2010). Every o t ∈ o 1:T then gets passed through a fully connected feed-forward network with one hidden layer and ReLU activation: s t = W 2 relu(W 1 o t + b 1 ) + b 2 . Let N tags be the number of possible NER-tags, d o the dimension of o t and d h the dimension of the hidden layer. The resulting vector s t ∈ R Ntags represents a score for every possible tag y at time step t. The values Conditional Random Field: A linear chain CRF models the conditional probability of an output sequence y 1:T given an input sequence x 1:T as: where Z (x 1:T ) is a normalization constant: φ is a potential function parametrized by a set of parameters Θ. In our case we use: Let θ be the parameters of the network described above. Then s θ,yt,t (x 1:T ) is the score that the network parametrized by θ outputs for tag y t at time step t given the input sequence x 1:T . A ∈ R Ntags×Ntags is a matrix such that A i,j is the score of transitioning from tag i to tag j.
Training: During training we try to maximize the likelihood of the true tag sequence y 1:T given the input feature vectors x 1:T . We use the Adam (Kingma and Ba, 2014) algorithm to optimize the parameters Θ = {θ, A}. Additionally we perform gradient clipping (Pascanu et al., 2012) and apply dropout (Srivastava et al., 2014) to the LSTM outputs o 1:T . The neural network parameters θ are randomly initialized from a normal distribution with mean zero and variance according to (Glorot and Bengio, 2010) (normal Glorot initialization). The transition scores A are initialized from a uniform distribution with mean zero and variance according to (Glorot and Bengio, 2010), (uniform Glorot initialization).

Transfer Learning
In this setting we use the WNUT 2016 corpus (Strauss et al., 2016) as an additional source of labeled data. The idea is to train the upper layers of the neural network on both datasets to improve its generalization ability. It was shown in (Yang et al., 2017) that this can improve the system performance. Figure 1b gives an overview of our transfer learning architecture.
Modified Architecture: We share all network layers except for the last linear projection to get separate tag scores for each data set: The resulting tag scores get fed to separate CRFs, which have separate transition matrices A 2016 and A 2017 .
Training: During training we alternately use a batch from each dataset and backpropagate the loss of the corresponding CRF.

Incorporating Sentence Level Features
Figure 1c shows how we include sentence level features into our architecture. In this setting we take an additional feature vector f sent = F (x 1:T ) ∈ R dsent for each input sentence x 1:T . Modified Architecture: We use an additional feed-forward network to extract tag scores s sent ∈ R Ntags from the sentence feature vector f sent : The dimensions used are: The value d h,sent is the dimension of the hidden layer of the feed-forward network. Let s 1:T,word be the scores that the basic network described in Section 2.1 outputs for sequence x 1:T . To get the final scores s 1:T fed to the CRF we add s sent to every s t,word ∈ s 1:T,word : s t = s sent + s t,word .

Combined System
The combined system adds the sentence level features to the transfer learning architecture. We share all layers except the linear projections to tag scores for both sentence features and word features in a manner analogous to Sections 2.2 and 2.3. The resulting architecture is shown in Figure 1d. This results in an embedding matrix E word ∈ R N vocab ×100 , where N vocab is the number of unique tokens in the WNUT 2016 and WNUT 2017 corpora. FastText predicts embedding vectors for words that were out-of-vocabulary during training by considering character n-grams of the word. The embedding matrix E word is not updated during training. Word Capitalization Features: Following (Chiu and Nichols, 2015) we add explicit capitalization features, since capitalization information is lost during word embedding lookups. The 6 feature options are: all capitalized, uppercase initial, all lower cased, mixed capitalization, emoji and other. An embedding matrix E wordCap ∈ R 6×d wordCap is used to feed these features to the network and updated during training via backpropagation. E wordCap is initialized using normal Glorot initialization. Character Convolution Features: A convolutional neural network is used to extract additional character level features. Its architecture is shown in Figure 2. First, we add special padding tokens on both sides of the character sequence w, to extend it to a target length, l w,max . If there is an odd number of paddings, the additional padding is added on the right. For sequences longer than l w,max , only the first l w,max characters are used. An embedding matrix E char ∈ R Nc×dc maps characters to R dc vectors. N c is the number of unique characters in the dataset with the addition of the padding token.
Using E char , we embed the padded sequence w and get C w ∈ R lw,max×dc . A set of m convolution filters ∈ R dc×h is then applied to C w . This results in m feature maps M i ∈ R lw,max−h+1 , which are passed through a ReLU activation. The final feature vector F ∈ R m is attained by max pooling, The embedding matrix is initialized using uniform Glorot initialization. The m convolution filters are initialized using normal Glorot initialization. Character Capitalization Convolution Features: Analogous to the word capitalization features, we use additional character capitalization features. The feature options are: upper, lower, punctuation, numeric and other. We apply a neural network with the same architecture as described above to extract the final character capitalization feature vector. Sentence Embeddings: In (Pagliardini et al., 2017) the authors introduce sent2vec, a new method for computing sentence embeddings. They show that these embeddings provide improved performance for several downstream tasks.
To train the sent2vec model, we use the same training set as the one used for word embeddings and we use default values for all the model parameters 2 . In particular, the resulting sentence feature vectors are in R 100 .

Experiments
We implemented the system described in Section 2.4 using the Tensorflow framework 3 .
We monitored the systems performance during training and aborted experiments that had an F1score of less than 40 after two epochs (evaluated on the development set). We let successful experiments run for the full 6 epochs (cf. Section 3.2). For the submission to WNUT 2017, we ran 6 successful experiments and submitted the one which 2 https://github.com/epfml/sent2vec 3 https://www.tensorflow.org/

Preprocessing
Tokenization: Since the WNUT 2016 and WNUT 2017 corpora are in the CoNLL format, they are already tokenized. To tokenize the additional tweets used for training word and sentence embeddings (cf. Section 2.5), we use the Twitter tokenizer provided by the Python NLTK library 4 . Token Substitution: We perform some simple pattern-based token substitutions. To normalize Twitter user handles, we substitute every word starting with an @ character by a special user token. Similarly, all words starting with the prefix http are replaced by a url token. Finally, for words longer than one character, we remove up to one initial # character. Table 1 shows the parameters used for training the model.

Experiments Performed After The Submission
Following the submission, we conducted additional experiments to investigate the influence of the transfer learning approach and sent2vec features on the system performance.   For each of the 4 systems described in Section 2, we ran 6 experiments. We use the same parameters as shown in Section 3.2. Table 2 shows precision, recall and F1-score of our system. We compute the mean and standard deviations over the 6 successful experiments we considered for submission (cf. Section 3). Table 3 shows the breakdown of the performance of the annotations we submitted for the WNUT 2017 shared task. Table 4 shows the performance of the different subsystems proposed in Section 2. We report the mean and standard deviation over the 6 experiments we performed after submission, for every system.

Results
All reported scores were computed using the evaluation script provided by the task organizers.

Discussion
From table 4 we can see that using sent2vec features increases precision and decreases recall slightly, leading to an overall lower performance compared to the basic system. The transfer learning system shows a more substantial decrease in precision and increase in recall and overall per-forms best out of the 4 systems. Combination of the two approaches is counterproductive and outperforms the basic system only slightly.
During training we observed that restarting experiments as described in Section 3 was only necessary when using sent2vec features.
One weakness of our transfer learning setting is that the two datasets we used have almost identical samples and only differ in their annotations. The WNUT 2016 corpus uses 10 entity classes: company, facility, Geo location, movie, music artist, other, person, product, sports team, and TV show. Further work is needed to study the effect of using an unrelated data set for transfer learning.

Conclusion
We described a deep learning approach for Named Entity Recognition on Twitter data, which extends a basic neural network for sequence tagging by using sentence level features and transfer learning. Our approach achieved 2nd place at the WNUT 2017 shared task for Named Entity Recognition, obtaining an F1-score of 40.78.
For future work, we plan to explore the power of transfer learning for NER in more depth. For instance, it would be interesting to see how annotated NER data for other languages or other text types affects the system performance.