IIT (BHU) Submission for the ACL Shared Task on Named Entity Recognition on Code-switched Data

This paper describes the best performing system for the shared task on Named Entity Recognition (NER) on code-switched data for the language pair Spanish-English (ENG-SPA). We introduce a gated neural architecture for the NER task. Our final model achieves an F1 score of 63.76%, outperforming the baseline by 10%.


Introduction
Named Entity Recognition (NER) is an important Natural Language Processing task, which involves extracting named entities (i.e., Names of Persons, Entities, Organizations etc.) from the provided text, and the classification of entities into a certain number of predefined categories. The extracted entities provide us with the important information about the content of the text (Nadeau and Sekine, 2007). For example, "New Delhi is famous for its historical past.". The extracted entity (New Delhi) gives us an idea that the text is associated with the location called New Delhi. The ability of NER to extract this useful information makes it an essential part of the Information Extraction pipeline.
The social media platforms like Twitter, Reddit etc. have become a massive source of information due to their growth in the recent past. Performing NER on social texts can be challenging due to the unstructured and colloquial nature of social texts. Various attempts have been made in the past to solve the problem of NER on social texts (Derczynski et al., 2017;Strauss et al., 2016). However, most of the previous systems were developed to work with monolingual texts (Ritter et al., 2011;Lin et al., 2017), ignoring the phenomena of codeswitching (i.e., switching between different lan-guages within a sentence), which is quite prevalent in social media texts. This paper describes our system for Named Entity Recognition Shared Task on English-Spanish Code-switched tweets held at the ACL 2018 Workshop on Computational Approaches to Linguistic Code-switching. The task involves categorizing a token into 19 different categories. More details about the task can be found in the task description paper (Aguilar et al., 2018).
We use a novel architecture based on gating of character-based representations and word-based representations of a token (Yang et al., 2016). The character-based representation is generated using a 'Char CNN'  and the wordbased representation is generated using an LSTM (Hochreiter and Schmidhuber, 1997). Furthermore, the activations from the last but one layer of the neural networks, trained with different hyperparameters, are ensembled and then are passed as features to a Conditional Random Field (CRF) classifier for final predictions. We make use of English Twitter embeddings (Godin et al., 2015), aligned with the Spanish embeddings (Bojanowski et al., 2016) as described in Section 2.1. Our final submitted system achieves the best result on the shared task with 63.76% F1-score.

Proposed Approach
This section describes feature representations, model description and the ensembling technique in detail.

Feature Representation
The following representations are used to capture overall information for each token: Word, Character and Lexical representations. Word Representation: Word representations are created using concatenation of two separate representations, one based on the pre-trained word vec- For the word vector representation, we use Spanish FastText word vectors (Bojanowski et al., 2016) of 300-dimensions, trained on Wikipedia and pre-trained word embeddings (Godin et al., 2015) of 400-dimensions, trained on 400 million tweets. We use a Principal Component Analysis (PCA) based algorithm suggested by Raunak (2017) to reduce the dimensions of the Twitter word vectors. Since these word vectors are in different vector spaces, we use Singular Value Decomposition (SVD) (Smith et al., 2017) for aligning these two embeddings to represent them in a single vector space. For POS tagging, we use the CMU Part-of-Speech tagger (Owoputi et al., 2013). Each POS tag is represented as a vector of dimension dim. The vectors corresponding to the POS tags are initialized randomly with uniform distribution range − 3/dim, 3/dim as suggested by He et al. (2015). The word vector corresponding to the token is concatenated with the vector corresponding to the POS tag of the token to obtain the final vector representation.
For obtaining the label for each token, we provide a composite vector as an input to the model. The composite vector is generated by concatenation of word representations of adjacent tokens (one on each side) with its own, same as a trigram. Character Representation: At the character level, we represent each token as a sequence of character embeddings. These embeddings are initialized randomly with uniform distribution range, similar to POS tag embeddings. In the model, they are kept trainable to learn the representation cor-responding to each character. Each token is either truncated or post-padded to generate a token of 20 characters.
Lexical Representation: We use the gazetteer provided by Mishra and Diesner (2016) and some Spanish gazetteers of our own to provide world knowledge to our model. Top 1000 celebrity Twitter handles from this list 1 are also added. We represent gazetteer input for a token as a 19 dimensional vector, one binary value corresponding to each class. The binary bit represents the presence (1) or absence (0) of the token in the gazetteer (i.e. word list) of the respective class.

Model Description
BiLSTM for Word Representation: We use Bidirectional LSTM (Dyer et al., 2015) in the model to learn the contextual relationship between the words. Word representations described earlier are used as input to this layer. The BiLSTM layer consists of two LSTM layers having 3 units each.
With one layer connected in the forward direction and the other layer connected in the backward direction, this captures the information from the past and the future (Ma and Hovy, 2016). The outputs of both forward and backward LSTM are then concatenated to produce a final single embedding for the input token. We vary recurrent dropouts (Gal and Ghahramani, 2016), input dropouts and output dropouts as shown in the Table 1, across three different models. The gate layer is fed with the output of this layer (X w ).
Convolution Network for Character Representation: We use a CNN-architecture to learn the character based representation of a word. The character embeddings of a token, denoted as R d×l , where d is the dimension of a single character's embedding and l is the max length of the token, is fed to a 2-stacked convolutional layer, both activated using ReLU function. Its results are then pushed into a pooling layer. We applied two different pooling techniques, specified in the Table 1, across different models. The output of the pooling layer serves as an input to a dense layer, whose activation function (Char dense layer activation) is varied as shown in Table 1. Finally, we use the output of the dense layer (X ch ) as an input to the gate layer.

Gate Layer:
The concatenation of word representations and POS tag embeddings is used as input to a sigmoid dense layer. The value of the sigmoid output controls the relative contribution of the character and word representation in the final representation of the token. Following the work of Miyamoto and Cho (2016), the output of this layer g is used to take the weighted average of Bi-LSTM network output (X g ) and the convolutional network output (X ch ): where v g is the trainable weight vector, b g is the bias and σ(·) is the sigmoid function. The result of this layer X is then concatenated with the gazetteer embeddings of the token.
Fully Connected Network: We use two fully connected networks after the concatenation of the gate network output and gazetteer embeddings. The number of dense units is kept fixed to 100 each. The activation function is varied according to Table 1 for producing different models.
Multitask Learning: Multitask learning has been shown as a good way to regularize models (Baxter, 2000;Collobert and Weston, 2008). Following the work of Aguilar et al. (2017), we split the task into Named Entity (NE) categorization (classifying a token into one of the NE classes) and NE segmentation (classifying token as NE or Not-NE). We passed the dense layer's output as input to these final classification layers. A softmax layer with 19 classes is used for the categorization task and a single sigmoid neuron is used for the segmentation task as depicted in Figure 1. The cross-entropy losses for these tasks are added to yield total loss for the model.

Conditional Random Fields and Ensembling
Linear-chain CRF classifier takes advantage of the sequence information to tag a token with the most probable label (Lafferty et al., 2001). Following Aguilar et al. (2017), we use the activations of second common dense layer as input feature vector for the CRF classifier. The CRF classifier produces better results than the normal softmax classification and also reduces the number of invalid predictions (i.e., I-PER tag without a B-PER tag). For preparing the model ensemble, we make use of 63.76% * GAP :Global Average Pooling + GM P :Global Max Pooling # nadam is adam rmsprop with nesterov momentum (Dozat, 2016) unweighted averaging of the activations generated by the networks described in Table 1.

Pre-processing
The data is pre-processed by doing the following replacements: All URLs are replaced with url . All hashtags are replaced with hashtag . Digits are replaced with the number token. Apostrophes are removed. Finally, emoticons are replaced with their respective meaning, for example, ':-)' with smile .

Hyper-parameters
Different hyper-parameters are used to produce different models for ensembling. We set the following parameters as the same across all the models: 64 filters, kernel size of 3 and ReLU activation in convolutional network (Section 2.2), along with 50 hidden units in the BiLSTM network (Section 2.2).
Other hyperparameters are set according to the Table 1 for the respective models. All models are trained for 15 epochs with a batch size of 512. The CRF classifier is used with the following parameters: L1 penalty: 1.0, L2 penalty: 1e-3 for 80 epochs. Hyper-parameters for Model-3 are obtained by a random search using hyperas 2 . Hyper-parameters for the other two models are set based on our own experimental observations. All our models 2 https://github.com/hyperopt/hyperopt are implemented using the Deep Learning library Keras 3 .

Results and Discussion
We compare our final results with the RNN baseline, which is the official baseline of the task (Aguilar et al., 2018). The major highlights of our results are described below. • Our model achieves an F1-score of 63.76%, which beats the baseline by around 10% on the test set. Our results depict the effectiveness of the use of gated neural architecture for Named Entity Recognition. Our system ranked first among the 8 systems submitted for the task.
• The system performance on the various class of entities is displayed in Table 2. Our model shows poor performance in Title, Other and Event categories. This may be attributed to both the diverse set of patterns present, and the unavailability of a large number of samples of these categories.

Conclusion
In this paper, we describe a gated neural network for performing NER on code-switched social media text. Our model involves the usage of SVD to align word representations of English and Spanish words. Furthermore, we also describe a novel way of ensembling activations of the last but one layer for achieving better results. Our model is described in full detail in this paper to ensure the replication of results. The final system performs the best among all the participating systems.
In future, we would like to experiment with various other ways of combining character and word representations (e.g. Fine Grained Gating , Highway Networks (Liang et al., 2017) etc.) for the NER task.