Distributed Representation, LDA Topic Modelling and Deep Learning for Emerging Named Entity Recognition from Social Media

This paper reports our participation in the W-NUT 2017 shared task on emerging and rare entity recognition from user generated noisy text such as tweets, online reviews and forum discussions. To accomplish this challenging task, we explore an approach that combines LDA topic modelling with deep learning on word level and character level embeddings. The LDA topic modelling generates topic representation for each tweet which is used as a feature for each word in the tweet. The deep learning component consists of two-layer bidirectional LSTM and a CRF output layer. Our submitted result performed at 39.98 (F1) on entity and 37.77 on surface forms. Our new experiments after submission reached a best performance of 41.81 on entity and 40.57 on surface forms.


Introduction
The shared task Emerging and Rare Entity Recognition at the 3rd Workshop on Noisy Usergenerated Text (W-NUT 2017) takes on the challenge of identifying unusual, previously-unseen entities in noisy texts such as tweets, online reviews and other social discussions (http://noisytext.github.io/2017/emerging-rare-entities.html). The emergent nature of novel named entities in user generated content and the often very creative natural of their surface forms make the task of automatic detection of such entities particularly difficult. To address such challenges, the shared task organizer prepared training, development and test datasets and provided to the participants. The datasets try to "resemble turbulent data containing few repeated entities, drawn from rapidlychanging text types or sources of non-mainstream entities". Results from the shared task are evaluated using F1 measures on the entities and surface forms found in the test data. It rewards systems at correctly detecting a diverse range of entities rather than only the frequent ones.
Inspired by the work of Limsopatham and Collier (2016, winner of w-nut 2016 shared task on Named Entity Recognition in Twitter), Chiu and Nichols (2016), and Huang et al (2015), we approached this shared task with bidirectional LSTM models (Long Short Term Memory recurrent neural network model) enhanced by CRF output layer, using both character-level and wordlevel embeddings as inputs. In addition, different from the study of Limsopatham and Collier (2016), we didn't make use of orthographic features of characters but tried to incorporate POS tags as well as document topics extracted from LDA topic modelling as optional inputs to the modelling process. The LDA topic modelling generates topic representation for each tweet which is used as a feature for each word in the tweet.
Our submitted result performed at 39.98 (F1) on entity and 37.77 (F1) on surface forms, using 10% of the combined training and development set for validation. After submission, we continued with more experiments, using data combining the training set and development set in training process, with ground truth available that helps the selection of the results. Our best result reached a performance of 41.81 on entity and 40.57 on surface forms.

Data and Preprocessing
The shared task datasets consist of a training set, a development set and a test set. Basic statistics of each data set in shown in Table 1. The shared task focuses on discovering 6 types of target entities and surface forms of: Corporation, Creative-Work, Group, Location, Person and Product (Derczynski et al, 2017).
In counting surface forms, every "word-label" combination has to be unique and letter case sensitive. When the same word appears twice but as different entities both are counted.
For the stop words removing, we utilized the Stopwords ISO (https://github.com/stopwordsiso/stopwords-en) list. The cutoff value for infrequent terms is set as one when applying LDA modelling.

Emerging and Rare Entity Detection from Social Media: Framework and Methods
Our approach to emerging and rare entity detection from social media is illustrated in Figure 1. Our methodology framework consists of the following components: (1) character-level embeddings and bidirectional LSTM modeling; (2) word level embeddings and bidirectional LSTM modelling; (3) LDA topic modelling, POS tags enhanced bidirectional LSTM; (4) fully connected layers, and (5) a CRF (Conditional Random Fields) output layer.

Character-level Representation
Character-level information was found to be valuable input for named entity recognition from social media (Limsopatham and Collier, 2016;Vosoughi et al, 2016). Chiu and Nichols (2015) found that modelling both the character-level and word-level embeddings within a neural network for named entity recognition helps improve the performance.
In our system, each character is represented as an N dimensional embedding which is learned and adjusted during the training process. The character level representations will then be merged into one M dimensional (50d-200d seems to work well) representation for each word. Character capitalization is kept.
We used 20-dimensional embeddings to represent each character. To learn character-level representations for each word we use a bidirectional LSTM to create a 200-dimensional representation for each word.

Word Embeddings
Word embeddings are distributed representation of words that offers continuous representations of words and text features such as the linguistic context of words (

POS Tagging
Part of Speech is also an important indicator of named entities, which we would like to include in our model (Huang et al, 2015). GATE Twitter POS tagger (https://gate.ac.uk/wiki/twitterpostagger.html) is used to assign POS tags for each word. POS tags is represented as 50dimensional trainable embedding.

LDA Topic Modelling
Topic modeling offers a powerful means for finding hidden thematic structure in large text collections. In topic models, topics are defined as a distribution over a fixed vocabulary of terms and documents are defined as a distribution over topics. LDA topic modeling and its variations represent the most popular methods (Blei et al, 2003;Blei, 2012).
We consider the topic composition of each Tweet or social media post an important indicator of subject domain context, which can be used to complement the local linguistic context of word vector. We make use of topic representation for each tweet derived from LDA modelling as a feature for each word in the Tweet.
We applied the online LDA method by Hoffman et al (2010), implemented in Genism (https://radimrehurek.com/gensim/models/ldamo del.html). It can handily analyze massive document collections or document streams.
When generate topic models, all the three datasets are combined into one corpus, and each entry is treated as a separate document. Each document is cleaned and preprocessed, which includes removing stop-words, punctuation and infrequent terms. An LDA model of 250 topics was trained and used for our system that generated submitted results. Using the model, we get a document level topic for each document, the topic value is then assigned to each word in the document. We also use the model to get a topic for each word in the documents. If the probability that a document or a word belongs to a topic is, the same for each topic, a special token is assigned to it instead of a topic. Each topic token is then assigned to a 250-dimensional embedding, embeddings for document and word-level topics are initialized separately.

Two-Layer Bidirectional LSTM
Bidirectional LSTM has been shown effective for modelling social media sentences (Huang et al., 2015;Dyer et al., 2015;Limsopatham and Collier, 2016). To learn deep neural models for named entity recognition we adopted a two-layer bidirectional LSTM, followed by two fully connected layers, and a Conditional Random Field (CRF) as an output layer where we maximize the joint likelihood.
For the first LSTM layer, we concatenate the 200-dimensional GloVe word embeddings and the 200-dimensional embeddings for character level representation. For the second layer, we concatenate the output of the first layer with the POSfeature embeddings and LDA-feature embeddings. The LSTM output dimensions are 256 for the first layer and 512 for the second layer.
After the second LSTM layer, we use two fully connected layers at each time step, and feed this representation into the CRF output-layer. The dimensions of the fully connected layers are 128 and 64 for the first and second layer respectively.
Between each layer in the network we applied dropout and batch normalization (Ioffe and Szregedy, 2015). A dropout rate of 0.25 is used for the first two layers of the network (the Character LSTM and the character + word LSTM). For all the other layers of the network, a dropout rate of 0.5 is used.
The fully connected layers are extra hidden layers before the CRF output layer, which allow the models to learn higher level representations without adding complexity through an extra compositional layer (Rei and Yannakoudakis, 2016).
Conditional random field (CRF) has shown to be one of the most effective methods for named entity recognition in general and in social media (Lafferty et al., 2001;McCallum and Li, 2003;Baldwin et al., 2015;Limsopatham and Collier, 2016). It also helped our system to gain performance in recognizing emerging entities and surface forms. The deep neural model was implemented using Keras with a TensorFlow backend and Keras community contributions for the CRF implementation. One model is trained for both entity and surface form recognition. Any feature can be included or excluded as needed when running the model.

Experiments and Results
In this section, we report two sets of experiments and results. Results from the 1 st set of experiments were submitted to the shared task organizer for evaluation. The 2 nd set of experiments are done after the submission. Using the ground truth released by the organizer we evaluated the results directly by ourselves. The ground truth being available also helps us in identifying the best model.

1 st Set of Experiments and Submitted Results
To train the model, the training set and the dev sets are merged, of which 10% (in terms of size, about half of the original dev set) are used for validation. We used a batch size of 32 for training, and the RMSprop optimizer with an initial leaning rate of 0.001. The results are shown in Table 2.
The results from all participating systems are presented in Table 3 (Derczynski, et al, 2017). The overall performance of our system reached 39.98 on entities and 37.77 on surface forms. The performance on Person and Location types of entities and surface forms are comparatively better, with F1 score at 55,88 and 47,38 respectively for entities, and F1 score at 53.30 and 42.80 for their surface forms. The system is less effective on identifying Corporation, Product, Creative-work and Group types of entities and surface forms, especially disappointing in terms of recall. For Creative-work and Product type entities, recall only reached 9.86% and 11.02% respectively.

2 nd Set of Experiments and Updated Results
After submission, we continued our modelling work with new training strategies. In terms of the data, all samples of the training set and dev set are used for training the model, which is then directly applied to test set. We also experimented more with different options of the number of topics in LDA topic modelling. We found that incorporating LDA features does have a positive effect on the performance. We used models with topic counts in the range of 20,50,150,250,350,450

Discussion and Conclusion
In this paper, we reported our participation in the W-NUT 2017 shared task on emerging and rare entity recognition from user generated noisy text. We described our system that leverages the power of LDA topic modelling, POS tags, character-level and word-level embeddings, bidirectional LTSM and CRF. The LDA topic modelling generates topic representation for each tweet or social media post. The deep learning model consists of twolayer bidirectional LSTM, two fully connected layers and a CRF output layer. We make use of topic representation for each tweet derived from LDA modelling as a feature for each word in a tweet or post. The topic composition of each post offers a certain subject domain context that could complement the local linguistic context of word embeddings.
We reported two sets of experiments and results. Results from the 1 st set of experiments were submitted to the shared task organizer for evaluation. Our submitted results performed at 39.98 (F1) on entities and 37.77 (F1) on surface forms.
The 2 nd set of experiments are done as follow up study after the submission, adopting a different training strategy. Using the ground truth released by the organizer we evaluated the results directly by ourselves. The ground truth being available helped us to identify the best model.
We experimented more with different options of the number of topics in LDA topic modelling. We found that incorporating LDA features does have a positive effect on the performance. The new results reached a best performance of 41.81 on entities and 40.57 on surface forms, with the number of topics set as 150. When the number of topics is set the same as for our submitted results (i.e. 250), the new results showed performance gain as well, reached 41.78 on entities and 39.90 on surface forms.
For future work, it would be interesting to train the LDA model on a larger corpus, to hopefully find a more accurate subject domain context for each tweet or post. It would be useful as well to explore the effects of alternative word embeddings such as fasttext. It would also be interesting to apply our system in identifying city event related entities and surface forms from other social media data.