Multi-channel BiLSTM-CRF Model for Emerging Named Entity Recognition in Social Media

In this paper, we present our multi-channel neural architecture for recognizing emerging named entity in social media messages, which we applied in the Novel and Emerging Named Entity Recognition shared task at the EMNLP 2017 Workshop on Noisy User-generated Text (W-NUT). We propose a novel approach, which incorporates comprehensive word representations with multi-channel information and Conditional Random Fields (CRF) into a traditional Bidirectional Long Short-Term Memory (BiLSTM) neural network without using any additional hand-craft features such as gazetteers. In comparison with other systems participating in the shared task, our system won the 2nd place.


Introduction
Named entity recognition (NER) is one of the first and most important steps in Information Extraction pipelines. Generally, it is to identify mentions of entities (persons, locations, organizations, etc.) within unstructured text. However, the diverse and noisy nature of user-generated content as well as the emerging entities with novel surface forms make NER in social media messages more challenging.
The first challenge brought by user-generated content is its unique characteristics: short, noisy and informal. For instance, tweets are typically short since the number of characters is restricted to 140 and people indeed tend to pose short messages even in social media without such restric- * The two authors made equal contributions. tions, such as YouTube comments and Reddit. 1 Hence, the contextual information in a sentence is very limited. Apart from that, the use of colloquial language makes it more difficult for existing NER approaches to be reused, which mainly focus on a general domain and formal text (Baldwin et al., 2015;Derczynski et al., 2015).
Another challenge of NER in noisy text is the fact that there are large amounts of emerging named entities and rare surface forms among the user-generated text, which tend to be tougher to detect (Augenstein et al., 2017) and recall thus is a significant problem (Derczynski et al., 2015). By way of example, the surface form "kktny", in the tweet "so.. kktny in 30 mins?", actually refers to a new TV series called "Kourtney and Kim Take New York", which even human experts found hard to recognize. Additionally, it is quite often that netizens mention entities using rare morphs as surface forms. For example, "black mamba", the name for a venomous snake, is actually a morph that Kobe Bryant created for himself for his aggressiveness in playing basketball games . Such morphs and rare surface forms are also very difficult to detect and classify.
The goal of this paper is to present our system participating in the Novel and Emerging Named Entity Recognition shared task at the EMNLP 2017 Workshop on Noisy User-generated Text (W-NUT 2017), which aims for NER in such noisy user-generated text. We investigate a multi-channel BiLSTM-CRF neural network model in our participating system, which is described in Section 3. The details of our implementation are in presented in Section 4, where we also present some conclusion from our experiments.

Problem Definition
The NER is a classic sequence labeling problem, in which we are given a sentence, in the form of a sequence of tokens w = (w 1 , w 2 , ..., w n ), and we are required to output a sequence of token labels y = (y 1 , y 2 , ..., y n ). In this specific task, we use the standard BIO2 annotation, and each named entity chunk are classified into 6 categories, namely Person, Location (including GPE, facility), Corporation, Consumer good (tangible goods, or well-defined services), Creative work (song, movie, book, and so on) and Group (subsuming music band, sports team, and non-corporate organizations).

Approach
In this section, we will first introduce the overview of our proposed model and then present each part of the model in detail. Figure 1 shows the overall structure of our proposed model, instead of solely using the original pretrained word embeddings as the final word representations, we construct a comprehensive word representation for each word in the input sentence. This comprehensive word representations contain the character-level sub-word information, the original pretrained word embeddings and multiple syntactical features. Then, we feed them into a Bidirectional LSTM layer, and thus we have a hidden state for each word. The hidden states are considered as the feature vectors of the words by the final CRF layer, from which we can decode the final predicted tag sequence for the input sentence.

Comprehensive Word Representations
In this subsection, we present our proposed comprehensive word representations. We first build character-level word representations from the embeddings of every character in each word using a bidirectional LSTM. Then we further incorporate the final word representation with the embedding of the syntactical information of each token, such as the part-of-speech tag, the dependency role, the word position in the sentence and the head position. Finally, we combine the original word embeddings with the above two parts to obtain the final comprehensive word representations.  Figure 1: Overview of our approach.

Character-level Word Representations
In noisy user-generated text analysis, sub-word (character-level) information is much more important than that in normal text analysis for two main reasons: 1) People are more likely to use novel abbreviations and morphs to mention entities, which are often out of vocabulary and only occur a few times. Thus, solely using the original word-level word embedding as features to represent words is not adequate to capture the characteristics of such mentions. 2) Another reason why we have to pay more attention to character-level word representation for noisy text is that it is can capture the orthographic or morphological information of both formal words and Internet slang. There are two main network structures to make use of character embeddings: one is CNN (Ma and Hovy, 2016) and the other is BiLSTM (Lample et al., 2016). BiLSTM turns to be better in our experiment on development dataset. Thus, we follow Lample et al. (2016) to build a BiLSTM network to encode the characters in each token as Figure 2 shows. We finally concatenate the forward embedding and backward embedding to the final character-level word representation.

Syntactical Word Representations
We argue that the syntactical information, such as POS tags and dependency roles, should also be explicitly considered as contextual features of each token in the sentence.
TweetNLP and TweeboParser (Owoputi et al., 2013;Kong et al., 2014) are two popular software to generate such syntactical tags for each token given a tweet. Given the nature of the noisy tweet text, a new set of POS tags and dependency   Table 1 for an example POS tagging. Since a tweet often contains more than one utterance, the output of TweeboParser will often be a multi-rooted graph over the tweet. Word position embedding are included as well as it is widely used in other similar tasks, like relation classification . Also, head position embeddings are taken into account while calculating these embedding vectors to further enrich the dependency information. It tries to exclude these tokens from the parse tree, resulting a head index of -1.
After calculating all 4 types of embedding vectors (POS tags, dependency roles, word positions, head positions) for every tokens, we concatenate them to form a syntactical word representation.

Combination with Word-level Word Representations
After obtaining the above two additional word representations, we combine them with the original word-level word representations, which are just traditional word embeddings. To sum up, our comprehensive word representations are the concatenation of three parts: 1) character-level word representations, 2) syntactical word representation and 3) original pretrained word embeddings.

BiLSTM Layer
LSTM based networks are proven to be effective in sequence labeling problem for they have access to both past and the future contexts. Whereas, hidden states in unidirectional LSTMs only takes information from the past, which may be adequate to classify the sentiment is a shortcoming for labeling each token. Bidirectional LSTMs enable the hidden states to capture both historical and future context information and then to label a token.
Mathematically, the input of this BiLSTM layer is a sequence of comprehensive word representations (vectors) for the tokens of the input sentence, denoted as (x 1 , x 2 , ..., x n ). The output of this BiLSTM layer is a sequence of the hidden states for each input word vectors, denoted as (h 1 , h 2 , ..., h n ). Each final hidden state is the concatenation of the forward ← − h i and backward − → h i hidden states. We know that

CRF Layer
It is almost always beneficial to consider the correlations between the current label and neighboring labels since there are many syntactical constrains in natural language sentences. For example, I-PERSON will never follow a B-GROUP. If we simply feed the above mentioned hidden states independently to a Softmax layer to predict the labels, then such constrains will not be more likely to be broken. Linear-chain Conditional Random Field is the most popular way to control the structure prediction and its basic idea is to use a series of potential function to approximate the conditional probability of the output label sequence given the input word sequence. Formally, we take the above sequence of hidden states h = (h 1 , h 2 , ..., h n ) as our input to the CRF layer, and its output is our final prediction label sequence y = (y 1 , y 2 , ..., y n ), where y i is in the set of all possible labels. We denote Y(h) as the set of all possible label sequences. Then we derive the conditional probability of the output sequence given the input hidden state sequence is , where W and b are the two weight matrices and the subscription indicates that we extract the weight vector for the given label pair (y i−1 , y i ).
To train the CRF layer, we use the classic maximum conditional likelihood estimation to train our model. The final log-likelihood with respect to the weight matrices is Finally, we adopt the Viterbi algorithm for training the CRF layer and the decoding the optimal output sequence y * .

Experiments
In this section, we discuss the implementation details of our system such as hyper parameter tuning and the initialization of our model parameters. 2

Parameter Initialization
For word-level word representation (i.e. the lookup table), we utilize the pretrained word embeddings 3 from GloVe (Pennington et al., 2014). For all out-of-vocabulary words, we assign their embeddings by randomly sampling from range where dim is the dimension of word embeddings, suggested by He et al.(2015). The random initialization of character embeddings are in the same way. We randomly initialize the weight matrices W and b with uniform samples from − 6 r+c , + 6 r+c , r and c are the number of the rows and columns, following Glorot and Bengio(2010). The weight matrices in LSTM are initialized in the same work while all LSTM hidden states are initialized to be zero except for the bias for the forget gate is initialized to be 1.0 , following Jozefowicz et al.(2015).

Hyper Parameter Tuning
We tuned the dimension of word-level embeddings from {50, 100, 200}, character embeddings from {10, 25, 50}, character BiLSTM hidden states (i.e. the character level word representation) from {20, 50, 100}. We finally choose the bold ones. The dimension of part-of-speech tags, dependecny roles, word positions and head positions are all 5.
As for learning method, we compare the traditional SGD and Adam (Kingma and Ba, 2014). We found that Adam performs always better than SGD, and we tune the learning rate form {1e-2,1e-3,1e-4}.

Results
To evaluate the effectiveness of each feature in our model, we do the feature ablation experiments and the results are shown in Table 2

Related Work
Conditional random field (CRF) is a most effective approaches (Lafferty et al., 2001;McCallum and Li, 2003) for NER and other sequence labeling tasks and it achieved the state-of-the-art performance previously in Twitter NER (Baldwin et al., 2015). Whereas, it often needs lots of hand-craft features. More recently,  introduced a similar but more complex model based on BiLSTM, which also considers hand-crafted features. Lample et al. (2016) further introduced using BiLSTM to incorporate character-level word representation. Whereas, Ma and Hovy (2016) replace the BiLSTM to CNN to build the characterlevel word representation. Limsopatham and Collier (2016), used similar model and achieved the best performance in the last shared task (Strauss et al., 2016). Based on the previous work, our system take more syntactical information into account, such as part-of-speech tags, dependency roles, token positions and head positions, which are proven to be effective.

Conclusion
In this paper, we present a novel multi-channel BiLSTM-CRF model for emerging named entity recognition in social media messages. We find that BiLST-CRF architecture with our proposed comprehensive word representations built from multiple information are effective to overcome the noisy and short nature of social media messages.