Multimedia Lab @ ACL WNUT NER Shared Task: Named Entity Recognition for Twitter Microposts using Distributed Word Representations

Due to the short and noisy nature of Twitter microposts, detecting named entities is often a cumbersome task. As part of the ACL2015 Named Entity Recognition (NER) shared task, we present a semi-supervised system that detects 10 types of named entities. To that end, we leverage 400 million Twitter microposts to generate powerful word embeddings as input features and use a neural network to execute the classiﬁcation. To further boost the performance, we employ dropout to train the network and leaky Rectiﬁed Linear Units (ReLUs). Our system achieved the fourth position in the ﬁnal ranking, without using any kind of hand-crafted features such as lexical features or gazetteers.


Introduction
Users on Online Social Networks such as Facebook and Twitter have the ability to share microposts with their friends or followers. These microposts are short and noisy, and are therefore much more difficult to process for existing Natural Language Processing (NLP) pipelines. Moreover, due to the informal and contemporary nature of these microposts, they often contain Named Entities (NEs) that are not part of any gazetteer.
In this challenge, we tackled Named Entity Recognition (NER) in microposts. The goal was to detect named entities and classify them in one of the following 10 categories: company, facility, geolocation, music artist, movie, person, product, sports team, tv show and other entities. To do so, we only used word embeddings that were automatically inferred from 400 million Twitter microposts as input features. Next, these word embeddings were used as input to a neural network to classify the words in the microposts. Finally, a post-processing step was executed to check for inconsistencies, given that we classified on a wordper-word basis and that a named entity can span multiple words. An overview of the task can be found in Baldwin et al. (2015).
The challenge consisted of two subtasks. For the first subtask, the participants only needed to detect NEs without categorizing them. For the second subtask, the NEs also needed to be categorized into one of the 10 categories listed above. Throughout the remainder of this paper, only the latter subtask will be considered, given that solving subtask two makes subtask one trivial.

Related Work
NER in news articles gained substantial popularity with the CoNLL 2003 shared task, where the challenge was to classify four types of NEs: persons, locations, companies and a set of miscellaneous entities (Tjong Kim Sang and De Meulder, 2003). However, all systems used hand-crafted features such as lexical features, look-up tables and corpusrelated features. These systems provide good performance at a high engineering cost and need a lot of annotated training data (Nadeau and Sekine, 2007). Therefore, a lot of effort is needed to adapt them to other types of corpora.
More recently, semi-supervised systems showed to achieve near state-of-the-art results with much less effort (Turian et al., 2010;Collobert et al., 2011). These systems first learn word representations from large corpera in an unsupervised way and use these word representations as input features for supervised training instead of using hand-crafted input features. There exist three major types of word representations: distributional, clustering-based and distributed word representations, and where the last type of representation is also known as a word embedding. A very popular and fast to train word embedding is the word2vec word representation of Mikolov et al. (2013). When complemented with traditional hand-crafted features, word representations can yield F1-scores of up to 91% (Tkachenko and Simanovsky, 2012).
However, when applied to Twitter microposts, the F1-score drops significantly. For example, Liu et al. (2011) report a F1 score of 45.8% when applying the Stanford NER tagger to Twitter microposts and Ritter et al. (2011) even report a F1-score of 29% on their Twitter micropost dataset. Therefore, many researchers (Cano et al., 2013;Cano et al., 2014) trained new systems on Twitter microposts, but mainly relied on cost-intensive handcrafted features, sometimes complemented with cluster-based features. Therefore, in this paper, we will investigate the power of word embeddings for NER applied to microposts. Although adding hand-crafted features such as lexical features or gazetteers would probably improve our F1-score, we will only focus on word embeddings, given that this approach can be easily applied to different corpora, thus quickly leading to good results.

System Overview
The system proposed for tackling this challenge consists of three steps. First, the individual words are converted into word representations. For this, only the word embeddings of Mikolov et al. (2013) are used. Next, we feed the word representations to a Feed-Forward Neural Network (FFNN) to classify the individual words with a matching tag. Finally, we execute a simple, rulebased post-processing step in which we check the coherence of individual tags within a Named Entity (NE).

Creating Feature Representations
Recently, Mikolov et al. (2013) introduced an efficient way for inferring word embeddings that are effective in capturing syntactic and semantic relationships in natural language. In general, a word embedding of a particular word is inferred by using the previous or future words within a number of microposts/sentences. Mikolov et al. (2013) proposed two architectures for doing this: the Continuous Bag Of Words (CBOW) model and the Skip-gram model.
To infer the word embeddings, a large dataset of microposts is used. The algorithm iterates a number of times over this dataset while updating the word embeddings of the words within the vocabulary of the dataset. The final result is a look-up table which can be used to convert every word w(t) in a feature vector w e (t). If the word is not in the vocabulary, a vector only containing zeros is used.

Neural Network Architecture
Based on the successful application of Feed-Forward Neural Networks (FFNN) using word embeddings as input features for both recognizing NEs in news articles (Turian et al., 2010;Collobert et al., 2011) and Part-of-Speech tagging of Twitter microposts (Godin et al., 2014), a FFNN is used as the underlying classification algorithm. Because a NE can consist of multiple words, the BIO (Begin, Inside, Outside NE) notation is used to classify the words. Given that there are 10 different NE categories that each have a Begin and Inside tag, the FFNN will assign a tag(t) to every word w(t) out of 21 different tags.
Because the tag tag(t) of a word w(t) is also determined by the surrounding words, a context window centered around w(t) that contains an odd number of words, is used. As shown in Figure 1, the corresponding word embeddings w e (t) are concatenated and used as input to the FFNN. This is the input layer of the neural network.
The main design parameters of the neural network are the type of activation function, the number of hidden units and the number of hidden layers. We considered two types of activation functions, the classic tanh activation function and the newer (leaky) Rectified Linear Units (ReLUs). The output layer is a standard softmax function.

Hidden Layer Feed Forward Neural Network
Sof tmax output layer Look − up table with word embeddings Figure 1: High-level illustration of the FFNN that classifies each word as part of one of the 10 named entity classes. At the input, a micropost containing four words is given. The different words w(t) are first converted in feature representations w e (t) using a look-up table of word embeddings. Next, a feature vector is constructed for each word by concatenating all the feature representations w e (t) of the other words within the context window. In this example, a context window of size three is used. One-by-one, these concatenated vectors are fed to the FFNN. In this example, a one-hidden layer FFNN is used. The output of the FFNN is the tag tag(t) of the corresponding word w(t).

Postprocessing the Neural Network Output
Given that NEs can span multiple words and given that the FFNN classifies individual words, we apply a postprocessing step after a micropost is completely classified to correct inconsistencies. The tags of the words are changed according to the following two rules: • If the NE does not start with a word that has a B(egin)-tag, we select the word before the word with the I(nside)-tag and replace the O(utside)-tag with a B-tag and copy the category of the I-tag.
• If the individual words of a NE have different categories, we select the most frequently occurring category. If it is a tie, we select the category of the last word within the NE.

Dataset
The challenge provided us with three different datasets: train, dev and dev_2015. These datasets have 1795, 599 and 420 microposts, respectively, also containing 1140, 356 and 272 NEs, respectively. The train and dev datasets came from the same period and therefore have some overlap in NEs. Moreover, they contained the complete dataset of Ritter et al. (2011). The microposts within dev_2015, however, were sampled more recently and resembled the test set of this challenge. The test set consisted of 1000 microposts, having 661 NEs. The train and dev dataset will be used as training set throughout the experiments and the dev_2015 dataset will be used as development set. For inferring the word embeddings, a set of raw Twitter microposts was used, collected during 300 days using the Twitter Streaming API, from 1/3/2013 till 28/2/2014. After removing all non-English microposts using the micropost language classifier of Godin et al. (2013), 400 million raw English Twitter microposts were left.

Preprocessing the Data
For preprocessing the 400 million microposts, we used the same tokenizer as Ritter et al. (2011). Additionally, we used replacement tokens for URLs, mentions and numbers on both the challenge dataset and the 400 million microposts we collected. However, we did not replace hashtags as doing so experimentally demonstrated to decrease the accuracy.

Training the Model
The model was trained in two phases. First, the look-up table containing per-word feature vectors was constructed. To that end, we applied the word2vec software (v0.1c) of Mikolov et al. (2013) on our preprocessed dataset of 400 million Twitter microposts to generate word embeddings. Next, we trained the neural network. To that end, we used the Theano library (v0.6) (Bastien et al., 2012), which easily effectuated the use of our NVIDIA Titan Black GPU. We used mini-batch stochastic gradient descent with a batch size of 20, a learning rate of 0.01 and a momentum of 0.5. We used the standard negative log-likelihood cost function to update the weights. We used dropout on both the input and hidden layers to prevent overfitting and used (Leaky) Rectified Linear Units (ReLUs) as hidden units (Srivastava et al., 2014). To do so, we used the implementation of the Lasagne 1 library. We trained the neural network on both the train and dev dataset and iterated until the accuracy on the dev_2015 set did not improve anymore.

Baseline
To evaluate our system, we made use of two different baselines. The word embeddings and the neural network architecture were evaluated in terms of word level accuracy. For these components, the baseline system simply assigned the O-tag to every word, yielding an accuracy of 93.53%. For the postprocessing step and the overall system evaluation, we made use of the baseline provided by the challenge, which performs an evaluation at the level of NEs. This baseline system uses lexical 1 https://github.com/Lasagne/Lasagne Table 1: Evaluation of the influence of the context window size of the word embeddings on the accuracy of predicting NER tags using a neural network with an input window of five words, 500 hidden Leaky ReLU units and dropout. All word embeddings are inferred using negative sampling and a Skip-gram architecture, and have a vector size of 400. The baseline accuracy is achieved when tagging all words of a micropost with the O-tag.

Word Embeddings
For inferring the word embeddings of the 400 million microposts, we mainly followed the suggestions of Godin et al. (2014), namely, the best word embeddings are inferred using a Skip-gram architecture and negative sampling. We used the default parameter settings of the word2vec software, except for the context window. As noted by Bansal et al. (2014), the type of word embedding created depends on the size of the context window. In particular, a bigger context window creates topic-oriented embeddings while a smaller context window creates syntax-oriented embeddings. Therefore, we trained an initial version of our neural network using an input window of five words and 300 hidden nodes, and evaluate the quality of the word embeddings based on the classification accuracy on the dev_2015 dataset. The results of this evaluation are shown in Table 1. Although the difference is small, a smaller context window consistently gave a better result.
Additionally, we evaluated the vector size. As a general rule of thumb, the larger the word embeddings, the better the classification (Mikolov et al., 2013). However, too many parameters and too few training examples will lead to suboptimal re-sults and poor generalization. We chose word embeddings of size 400 because smaller embeddings experimentally showed to capture not as much detail and resulted in a lower accuracy. Larger word embeddings, on the other hand, made the model too complex to train.
The final word2vec word embeddings model has a vocabulary of 3,039,345 words and word representations of dimensionality 400. The model was trained using the Skip-gram architecture and negative sampling (k = 5) for five iterations, with a context window of one and subsampling with a factor of 0.001. Additionally, to be part of the vocabulary, words should occur at least five times in the corpus.

The Neural Network Architecture
The next step is to evaluate the Neural Network Architecture. The most important parameters are the size of the input layer, the size of the hidden layers and the type of hidden units. Although we experimented with multiple hidden layers, the accuracy did not improve. We surmise that (1) the training set is too small and that (2) the word embeddings already contain a fixed summary of the information and therefore limit the feature learning capabilities of the neural network. Note that the word embeddings at the input layer can also be seen as a (fixed) hidden layer.
First, we will evaluate the activation function and the effectiveness of dropout (p = 0.5). We compared the classic tanh function and the leaky ReLU with a leak rate of 0.01 2 . As can be seen in Table 2, both activation functions perform equally good when no dropout is applied during training. However, when dropout is applied, the gap between the two configurations becomes larger. The combination of ReLUs and dropout seems to be the best one, compared to the classic configuration of a neural network 3 .
Next, we will evaluate a number of neural network configurations for which we varied the input layer size and the hidden layer size. The results are depicted in Table 3. Although the differences are small, the best configuration seems to be a neural network with five input words and 500 hidden nodes. Finally, we evaluate our best model on the NE level instead of on the word level. To that end, we calculated the F1-score of our best model using the provided evaluation script. The F1-score of our best model on dev_2015 is 45.15%, which is an absolute improvement of almost 11% and an error reduction of 16.52% over the baseline (34.29%) provided by the challenge.

Postprocessing
As a last step, we corrected the output of the neural network for inconsistencies because our classifier does not see the labels of the neighbouring words. As can be seen in Table 4, the postprocessing causes a significant error reduction, yielding a final F1-score of 49.09% on the dev_2015 development set, and an error reduction of 22.52% over the baseline.
Additionally, we also report the F1-score on the dev_2015 development set for subtask one, where the task was to only detect the NEs but not to categorize them into the 10 different categories. For this, we retrained the model with the best parameters of subtask two. The results are shown in Table 5.
If we compare both subtasks, we see that the neural network has a similar error reduction for both subtasks but that the postprocessing step effectuates a larger error reduction for subtask two. In other words, a common mistake of the neural network is to assign different categories to different words within one NE. These mistakes are easily corrected by the postprocessing step.

Conclusion
In this paper, we presented a system to apply Named Entity Recognition (NER) to microposts. Given that microposts are short and noisy com-pared to news articles, we did not want to invest time in crafting new features that would improve NER for microposts. Instead, we implemented the semi-supervised architecture of Collobert et al. (2011) for NER in news articles. This architecture only relies on good word embeddings inferred from a large corpus and a simple neural network. To realize this system, we used the word2vec software to quickly generate powerful word embeddings over 400 million Twitter microposts. Additionally, we employed a state-of-the-art neural network for classification purposes, using leaky Rectified Linear Units (ReLUs) and dropout to train the network, showing a significant benefit over classic neural networks. Finally, we checked the output for inconsistencies when categorizing the named entities.
Our word2vec word embeddings trained on 400 million microposts are released to the community and can be downloaded at http://www.fredericgodin.com/software/.