Named Entity Recognition for Social Media Texts with Semantic Augmentation

Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts, especially user-generated social media content. Semantic augmentation is a potential way to alleviate this problem. Given that rich semantic information is implicitly preserved in pre-trained word embeddings, they are potential ideal resources for semantic augmentation. In this paper, we propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account. In particular, we obtain the augmented semantic information from a large-scale corpus, and propose an attentive semantic augmentation module and a gate module to encode and aggregate such information, respectively. Extensive experiments are performed on three benchmark datasets collected from English and Chinese social media platforms, where the results demonstrate the superiority of our approach to previous studies across all three datasets.


Introduction
The increasing popularity of microblogs results in a large amount of user-generated data, in which texts are usually short and informal. How to effectively understand these texts remains a challenging task since the insights are hidden in unstructured forms of social media posts. Thus, named entity recognition (NER) is a critical step for detecting proper entities in texts and providing support for downstream natural language processing (NLP) tasks (Pang et al., 2019;Martins et al., 2019).
However, NER in social media remains a challenging task because (i) it suffers from the data spar-* Equal contribution. † Corresponding author. 1 The code and the best performing models are available at https://github.com/cuhksz-nlp/SANER. sity problem since entities usually represent a small part of proper names, which makes the task hard to be generalized; (ii) social media texts do not follow strict syntactic rules (Ritter et al., 2011). To tackle these challenges, previous studies tired to leverage domain information (e.g., existing gazetteer and embeddings trained on large social media text) and external features (e.g., part-of-speech tags) to help with social media NER (Peng and Dredze, 2015;Aguilar et al., 2017). However, these approaches rely on extra efforts to obtain such extra information and suffer from noise in the resulted information. For example, training embeddings for social media domain could bring a lot unusual expressions to the vocabulary. Inspired by studies using semantic augmentation (especially from lexical semantics) to improve model performance on many NLP tasks (Song and Xia, 2013;Song et al., 2018a;Kumar et al., 2019;Amjad et al., 2020), it is also a potential promising solution to solving social media NER. Figure 1 shows a typical case. "Chris", supposedly tagged with "Person" in this example sentence, is tagged as other labels in most cases. Therefore, in the predicting process, it is difficult to label "Chris" correctly. A sound solution is to augment the semantic space of "Chris" through its similar words, such as "Jason" and "Mike", which can be obtained by existing pre-trained word embeddings from the general domain.
In this paper, we propose an effective approach to NER for social media texts with semantic augmentation. In doing so, we augment the semantic space for each token from pre-trained word embedding models, such as GloVe (Pennington et al., 2014) and Tencent Embedding (Song et al., 2018b), and encode semantic information through an attentive semantic augmentation module. Then we apply a gate module to weigh the contribution of the augmentation module and context encoding module in the NER process. To further improve NER performance, we also attempt multiple types of pre-trained word embeddings for feature extraction, which has been demonstrated to be effective in previous studies (Akbik et al., 2018;Jie and Lu, 2019;Kasai et al., 2019;Kim et al., 2019;. To evaluate our approach, we conduct experiments on three benchmark datasets, where the results show that our model outperforms the stateof-the-arts with clear advantage across all datasets.

The Proposed Model
The task of social media NER is conventionally regarded as sequence labeling task, where an input sequence X = x 1 , x 2 , · · · , x n with n tokens is annotated with its corresponding NE labels Y = y 1 , y 2 , · · · , y n in the same length. Following this paradigm, we propose a neural model with semantic augmentation for the social media NER. Figure  2 shows the architecture of our model, where the backbone model and the semantic augmentation module are illustrated in white and yellow backgrounds, respectively. For each token in the input sentence, we firstly extract the most similar words of the token according to their pre-trained embeddings. Then, the augmentation module use an attention mechanism to weight the semantic information carried by the extracted words. Afterwards, the weighted semantic information is leveraged to enhance the backbone model through a gate module.
In the following text, we firstly introduce the encoding procedure for augmenting semantic information. Then, we present the gate module to incorporate augmented information into the backbone model. Finally, we elaborate the tagging procedure for NER with the aforementioned enhancement.

Attentive Semantic Augmentation
The high quality of text representation is the key to obtain good model performance for many NLP tasks (Song et al., 2017;Sileo et al., 2019). However, obtaining such text representation is not easy in the social media domain because of data sparsity problem. Motivated by this fact, we propose se- Figure 2: The overall architecture of our proposed model with semantic augmentation. An example sentence and its output NE labels are given, where the augmented semantic information for the word "Chris" are also illustrated with the processing through the augmentation module and the gate module. mantic augmentation mechanism for social media NER by enhancing the representation of each token in the input sentence with the most similar words in their semantic space, which can be measured by pre-trained embeddings.
In doing so, for each token x i ∈ X , we use pretrained word embeddings (e.g., GloVe for English and Tencent Embedding for Chinese) to extract the top m words that are most similar to x i based on cosine similarities and denote them as Afterwards, we use another embedding matrix to map all extracted words c i,j to their corresponding embeddings e i,j . Since not all c i,j ∈ C i are helpful for predicting the NE label of x i in the given context, it is important to distinguish the contributions of different words to the NER task in that context. Consider that the attention and weight based approaches are demonstrated to be effective choices to selectively leverage extra information in many tasks (Kumar et al., 2018;Margatina et al., 2019;Tian et al., 2020a,d,b,c), we propose an attentive semantic augmentation module (denoted as AU ) to weight the words according to their contributions to the task in different contexts. Specifically, for each token x i , the augmentation module assigns a weight to each word c i,j ∈ C i by where h i is the hidden vector for x i obtained from the context encoder with its dimension matching that of the embedding (i.e., e i,j ) of c i,j . Then, we apply the weight p i,j to the word c i,j to compute the final augmented semantic representation by where v i is the derived output of AU , and contains the weighted semantic information. Therefore, the augmentation module ensures that the augmented semantic information are weighted based on their contributions and important semantic information is distinguished accordingly.

The Gate Module
We observe that the contribution of the obtained augmented semantic information to the NER task could vary in different contexts and a gate module (denoted by GA) is naturally desired to weight such information in the varying contexts. Therefore, to improve the capability of NER with the semantic information, we propose a gate module to aggregate such information to the backbone NER model. Particularly, we use a reset gate to control the information flow by where W 1 and W 2 are trainable matrices and b g the corresponding bias term. Afterwards, we use to balance the information from context encoder and the augmentation module, where u i is the derived output of the gate module; • represents the element-wise multiplication operation and 1 is a 1-vector with its all elements equal to 1.

Tagging Procedure
To provide h i to the augmentation module, we adopt a context encoding module (denoted as CE) proposed by . Compared with vanilla Transformers, this encoder additionally models the direction and distance information of the input, which has been demonstrated to be useful for the NER task. Therefore, the encoding procedure of the input text can be denoted as where H = [h 1 , h 2 , · · · , h n ] and E = [e 1 , e 2 , . . . , e n ] are lists of hidden vectors and embeddings of X , respectively. In addition, since pretrained word embeddings contain substantial con- text information from large-scale corpus, and different types of them may contain diverse information, a straightforward way of incorporating them is to concatenate their embedding vectors by where e i is the final word embedding for x i and T the set of all embedding types. Afterwards, a trainable matrix W u is used to map u i obtained from the gate module to the output space by o i = W u · u i . Finally, a conditional random field (CRF) decoder is applied to predict the labels y i ∈ L (where L is the set with all NE labels) in the output sequence Y by where W c and b c are the trainable parameters to model the transition for y i−1 to y i .

Settings
In our experiments, we use three social media benchmark datasets, including WNUT16 (W16) (Strauss et al., 2016), WNUT17 (W17) (Derczynski et al., 2017), and Weibo (WB) (Peng and Dredze, 2015), where W16 and W17 are English datasets constructed from Twitter, and WB is built from Chinese social media platform (Sina Weibo). For all three datasets, we use their original splits and report the statistics of them in Table 1 (e.g., the number of sentences (#Sent.), entities (#Ent.), and the percentage of unseen entities (%Uns.) with respect to the entities appearing in the training set).
For model implementation, we follow Lample et al. (2016) to use the BIOES tag schema to represent the NE labels of tokens in the input sentence.
For the text input, we try two types of embeddings  Table 2: F 1 scores of the baseline model and ours enhanced with semantic augmentation ("SE") and the gate module ("GA") on the development (a) and test (b) sets. "DS" and "AU " represent the direct summation and attentive augmentation module, respectively. Y and N denote the use and non-use of corresponding modules.
for each language. 2 Specifically, for English, we use ELMo (Peters et al., 2018) and BERT-cased large (Devlin et al., 2019); for Chinese, we use Tencent Embedding (Song et al., 2018b), and ZEN (Diao et al., 2019). 3 In the context encoding module, we use a two-layer transformer-based encoder proposed by  with 128 hidden units and 12 heads. To extract similar words carrying augmented semantic information, we use the pretrained word embeddings from GloVe for English and those embedding from Tencent Embeddings for Chinese to extract the most similar 10 words (i.e., m = 10) 4 . In the augmentation module, we randomly initialize the embeddings of the extracted words (i.e., e i,j for c i,j ) to represent the semantic information carried by those words. 5 During the training process, we fix all pre-trained embeddings in the embedding layer and use Adam (Kingma and Ba, 2015) to optimize negative loglikelihood loss function with the learning rate set to η = 0.0001, β 1 = 0.9 and β 2 = 0.99. We train 50 epochs for each method with the batch size set to 32 and tune the hyper-parameters on the development set 6 . The model that achieves the best performance on the development set is evaluated on the test set with the F 1 scores obtained from the official conlleval toolkits 7 .

Overall Results
To explore the effect of the proposed attentive semantic augmentation module (AU ) and the gate module (GA), we run different settings of our model with and without the modules. In addition, we also try baselines that use direct summation (DS) to leverage the semantic information carried by the similar words, where the embeddings of the words are directly summed without weighting through attentions. The experimental results (F 1) of the baselines and our approach on the development and test sets of all datasets are reported in Table 2(a) and (b), respectively.
There are some observations from the results on the development and test sets. First, compared to the baseline without semantic augmentation (ID=1), models using direct summation (DS, ID=2) to incorporate different semantic information undermines NER performance on two of three datasets, namely, W17 and WB; on the contrary, the models using the proposed attentive semantic augmentation module (AU , ID=4) consistently outperform the baselines (ID=1 and ID=2) on all datasets. It indicates that AU could distinguish the contributions of different semantic information carried by different words in the given context and leverage them accordingly to improve NER performance. Second, comparing the results of models with and without the gate module (GA) (i.e. ID=3 vs. ID=2 and ID=5 vs. ID=4), we find that the models with gate module achieves superior performance to the others without it. This observation suggests that the importance of the information from the context encoder and AU varies, and the proposed gate module is effective in adjusting the weights according to their contributions.
Moreover, we compare our model under the best setting with previous models on all three datasets in Table 3  art performance is established. The reason could be that compared to previous studies, our model is effective to alleviate the data sparsity problem in social media NER with the augmentation module to encode augmented semantic information. Besides, the gate module can distinguish the importance of information from the context encoder and AU according to their contribution to NER.

Performance on Unseen Named Entities
Since this work focuses on addressing the data sparsity problem in social media NER, where the unseen NEs are one of the important factors that hurts model performance. To analyze whether our approach with attentive semantic augmentation (AU ) and the gate module (GA) can address this problem, we report the recall of our approach (i.e., "+AU +GA") to recognize the unseen NEs on the test set of all datasets in Table 4. For reference, we also report the recall of the baseline without AU and GA, as well as our runs of previous studies (marked by " * "). It is clearly observed that our approach outperforms the baseline and previous studies on unseen NEs on all datasets, which shows that it can appropriately leverage semantic information carried by similar words and thus alleviate the data sparsity problem.

Case Study
To demonstrate how the augmented semantic information improves NER with the attentive augmentation module and the gate module, we show the extracted augmented information for the word "Chris" and visualize the weights for each augmented term in Figure 3, where deeper color refers to higher  Table 4: The recall of our models with and without the attentive semantic augmentation (AU ) and the gate module (GA) on unseen named entities (whose numbers are also reported at the first row) on all three datasets. The results of our runs of previous models (marked with " * ") are also reported for references. weight. In this case, the words "steve" and "jason" have higher weights in AU . The explanation could be that in all cases, these two words are a kind of "Person". Thus, higher attention to these terms helps our model to identify the correct NE label. On the contrary, the term "anderson" and "andrew" never occur in the dataset, and therefore provide no helpful effect in this case and eventually end with the lower weights in AU . In addition, a model can also mislabel "Chris" as "Music-Artist", because "Chris" belongs to that NE type in most cases and there is a word "filming" in its context. However, our model with the gate module can distinguish that the information from semantic augmentation is more important and thus make correct prediction.

Conclusion
In this paper, we proposed a neural-based approach to enhance social media NER with semantic augmentation to alleviate data sparsity problem. Particularly, an attentive semantic augmentation module is suggested to encode semantic information and a gate module is applied to aggregate such information to tagging process. Experiments conducted on three benchmark datasets in English and Chinese show that our model outperforms previous studies and achieves the new state-of-the-art result.  In our main experiments, we use two types of embeddings for each language: ELMo (Peters et al., 2018) and BERT-cased large (Devlin et al., 2019) for English, and Tencent Embedding (Song et al., 2018b) and ZEN (Diao et al., 2019) for Chinese. In Table 5, we report the results (F 1 scores) of our model with the best setting (i.e. the full model with semantic augmentation (AU ) and gate module (GA)) as well as the baselines without AU and GA, where either one of the two types of embedding is used to represent the input sentence. From the results, it is found that our model with AU and GA can consistently outperforms the baseline models with different settings of embeddings.   Table 7: Experimental results (F 1 scores) of our best performing models (i.e., the ones with AU and GA) using different types of pre-trained embeddings as the source to extract similar words. The results of baseline (the one without AU and GA) are also reported.
In addition to use embeddings for input sentence representation, we also try different embedding sources (i.e. pre-trained word embeddings) to extract similar words for each token in the input sentence. For English, we use Word2vec (Mikolov et al., 2013) and Glove ; for Chinese, we use Giga (Zhang and Yang, 2018) and Tencent Embedding (Song et al., 2018b). 8 The experimental results of our model with the best setting (i.e., the one with AU and GA) using different sources are reported in Table 7. The result of the baseline model without AU and GA is also reported for reference. The results show that our approach can consistently outperforms the baseline with different sources to find similar words, which demonstrates the robustness of our approach.  We report all values of the hyper-parameters tried for our models in Table 8, where we try different combinations of them and find the best hyper-