Weakly Supervised Attention Networks for Entity Recognition

The task of entity recognition has traditionally been modelled as a sequence labelling task. However, this usually requires a large amount of fine-grained data annotated at the token level, which in turn can be expensive and cumbersome to obtain. In this work, we aim to circumvent this requirement of word-level annotated data. To achieve this, we propose a novel architecture for entity recognition from a corpus containing weak binary presence/absence labels, which are relatively easier to obtain. We show that our proposed weakly supervised model, trained solely on a multi-label classification task, performs reasonably well on the task of entity recognition, despite not having access to any token-level ground truth data.


Introduction
Entity Recognition frequently finds use as a first step in numerous downstream NLP tasks (Wang and Xue, 2017;Liang et al., 2018). Traditionally, it has been posed as a sequence labeling task (Lample et al., 2016;Ma and Hovy, 2016), which in turn requires corpora with token-based annotations. A key drawback of this formulation, however, lies in its dependence on corpora annotated at the token-level, which can often be tedious to obtain and expensive to annotate.
One potential way of overcoming this limitation is to move towards a method that utilizes a weaker form of supervision that is easier to obtain. In this work, we focus on one such form of weak supervision: binary labels that indicate the presence of an entity type. The cognitive load of selecting whether an entity type is present or not is usually less than that of actually highlighting and annotating spans with their correct entity types. It also stands to reason that providing these binary labels ⇤ Equal Contribution might be faster. Both these properties are particularly advantageous for a human-in-the-loop setup in a user facing task, since a user is more likely to answer a yes/no question than to provide the annotated entity spans. This, in turn, facilitates cheaper and faster data collection; be it explicitly in the form of feedback questions, or implicitly from mined user logs (for example, clicked search engine results for queries related to "movies" are likely to contain the entity in question (Xu et al., 2009)).
In this work, we make first steps towards moving away from span-based corpora, relying solely on binary presence/absence classification labels for extracting entities. We propose a novel attention-based model that, though trained on a multi-label classification task, can be used for entity recognition. We show the efficacy of our proposed model on the widely used 2003 CoNLL dataset. Our model achieves reasonable performance without having access to token-level annotations. We thus show that it is possible to extract entities using a weak classification signal. 1

Related Work
Commonly used methods for entity extraction rely on token-level annotated corpora. Conventionally, these supervised methods learn a CRF (Lafferty et al., 2001) or a Seq2Seq (Sutskever et al., 2014) model over either hand-crafter or neural features. More recently, pre-trained embeddings from language models trained on large corpora (Peters et al., 2018;Devlin et al., 2018), when augmented with previous methods, have shown marked improvements.
A contrasting line of work has been to explore unsupervised entity extraction without using ground-truth token or sentence level annotations. For example, a common paradigm for unsupervised Named Entity Recognition involves relying on a seed gazetteer, as in the case of Zhang and Elhadad (2013) and Ghiasvand and Kate (2015) both in the medical domain. In a more general setting, Carlson et al. (2009) use gazetteers to bootstrap training by labelling sequences that can be confidently annotated, and then use this partly labelled data to train their proposed Partial Perceptron algorithm; although they make use of a gazetteer, being the closest in setting to our proposed approach, we use Carlson et al. (2009) as our primary baseline. Another common technique involves bootstrapping the system with a set of rule templates, such as in (Etzioni et al., 2005) and (Collins and Singer, 1999). However, these methods often rely on an initial seed rule-base or on the availability of gazetteers (or the effective generation of these gazetteers using an online source such as Wikipedia).
Aside from directly improving performance on various tasks, attention Luong et al., 2015) has proven to be extremely useful when used indirectly in a wide variety of other ways (for example, for segmentation (Tang and Yang, 2018) and unsupervised speechto-text alignment (Boito et al., 2017;Godard et al., 2018)). In addition, using attention-based models for object segmentation in a weakly supervised setting has been well explored in the vision domain (Teh et al., 2016;Zhang et al., 2018). Inspired by this, we leverage the attention weights of the model to identify entity spans.

Method
Figures 1 and 2 describe the different components of our model. Our model comprises of 4 modules: a sentence representation module, an attention module, a token-level tagger and a sentence-level classifier. Concretely, given a sentence (x 1 , · · · x l ), the model predicts whether a tag t 2 T is present in the sentence, as well as an attention distribution over the words for each tag (↵ 1,t · · · ↵ l,t )8t 2 T (where T is the set of entity tags), as described below:  Token Representation Module: generates an embedding for each token, representing the word's meaning and its left and right context. We use BERT embeddings (Devlin et al., 2018) for generating a word-level representation. We further use a Bidirectional GRU  layer to better adapt BERT embeddings to the task to obtain token representations (e 1 · · · e l ).
Attention Module: consists of an attention mechanism for each tag type. Given the token embeddings, a softmax distribution pertaining to the corresponding tag is generated, modelling the conditional probability of a word in a sentence being of that tag, given that the entire sentence contains the tag. We compute attention as in , using one learned query vector q t per tag t 2 T . The token embeddings (e 1 · · · e l ) are passed through a dense layer to generate keys (k 1 · · · k l ) in the query space, which together with query q t , yield a set of attention weights (↵ 1,t · · · ↵ l,t ).
Sentence-level Classifier: generates the probability of the presence of a tag in the sentence. For each tag, the token-level representations (e 1 · · · e l ) are weighed by the attention distribution (↵ 1,t · · · ↵ l,t ) corresponding to the tag. The weighted sum generates a sentence representation s t per tag, which is passed through a sigmoid layer to generate the probability of the tag being present (p t ), with p t > 0.5 denoting the presence of a tag.
Token Tagger Module: combines the probabilities from the Sentence-level Classifier with the attention weights obtained from the Attention Module to generate BIO tags for each token. Only the attention weights pertaining to the predicted labels T 0 are considered (i.e., if no tag is predicted, the entire sentence is marked "O"), with the attention weights being scaled by the probability of the predicted label (i.e p t ⇤ (↵ 1,t · · · ↵ l,t )). A word x i is assigned the label y i = argmax t2T 0 (p t ⇤ ↵ i,t ) if p t ⇤ ↵ i,y i is greater than a small threshold ✏ and it is neither a punctuation symbol nor a stop-word ( Figure 2).

Experiments
Dataset: To demonstrate the feasibility of our model, we adapt the commonly used CoNLL 2003 dataset (Ratinov and Roth, 2009). The dataset contains token-level annotations of the Reuter's Corpus (Lewis et al., 2004) for 4 entity types: person (PER), location (LOC), organization (ORG) and miscellaneous (MISC), with each token being tagged in the IOB format (Ramshaw and Marcus, 1999). For training and validation, we strip out the token-level annotations, instead annotating sentences to merely indicate the presence of entity-types. In order to quantify the quality of the extracted entities, we then measure the span level f-score on the test set using the gold entity spans.
Modelling Choices and Hyperparameters: We use the BERT-Base Multilingual Cased model for the Token Representation Module, similar to the NER tagging setup in Devlin et al. (2018). We experiment both with using the embeddings as is, and fine-tuning the top layers of the attention encoder (we only try fine-tuning the top, toptwo and top-three layers due to computational constraints). For the Token Tagger module, we find that the attention probabilities usually demarcate the tagged words quite clearly, and that an ✏ value of 0.01 performs reasonably well. We use early stopping based on the (averaged) validation accuracy of the sentence-level predictions. We also observe that fine-tuning the BERT model requires learning rates comparable in order of magnitude to those used in Devlin et al. (2018), and hence use a learning rate of 2e-7 for fine-tuning the transformer layers, and 1e-3 for the rest of the network. More details related to the hyperparameters used are presented in Appendix A. All our models have been implemented using the AllenNLP framework (Gardner et al., 2017).

Results and Analysis
Type Model F-score Acc S (Lample et al., 2016) 90.9 - (Peters et al., 2018) 92.2 - (Devlin et al., 2018) 92.8 -US (Carlson et al., 2009) Table 1 shows the performance of our model. We observe that our proposed approach significantly outperforms the baseline, and performs reasonably well when compared to various stateof-the-art supervised approaches that use significantly more ground-truth annotation information.
To further investigate the impact of the different components of the model, we ablate our model components (Table 1). We observe that having a contextual GRU layer for adapting to the task has a significant impact on the performance, with the final model performing much better than BERT + PS. Further, fine-tuning and probability scaling also help improve model performance.

Impact of Stop-Word Removal
We find that the model learns to focus on indicator words that frequently occur in sentences where a particular tag is present, and tends to use them to identify the presence of entities at the sentence level. For example, the word "at" is often indicative of the existence of a location in a sentence, the name of the location itself aside, and the model tends to focus on both. This behaviour is unsurprising, given that the model is trained purely using a signal of whether or not a sentence contains an entity of that type, with no idea about the entity boundaries themselves. Based on what we commonly observe, a few of these indicator words include: prepositions such as "at" and "in" when a LOC entity is present; common titles preceding PER names (such as "President" X or "Miss" Y); conjunctions that might separate two or more entities (such as X "and" Y).
This results in the model picking out spurious words alongside the actual entity, which necessitates the use of stop-word and symbol removal. However, this removal also results in the model not being able to pick out words when they occur within an entity span (for example, "Republic of Iceland"). This is particularly problematic for ORG (4.85% spans have stop-words) and MISC (3.15% spans have stop-words), compared to LOC(0.6%) and PER(0.76%). This issue can potentially be mitigated with either a more sophisticated tagger module or a better stopword/symbol removal mechanism, which we leave to future work.

Errors due to Incorrect Entity Boundaries
Model Text Type Micro  Since our model does not have access to annotated training data, it has no direct supervision for learning entity boundaries. This particularly hurts the model in the CoNLL task, since precision, recall and f-score are measured based on an exact string match. In order to investigate this, we use the metric used in MUC events (Grishman and Sundheim, 1996;Chinchor, 1998), wherein a system is scored on two axes: finding the correct text and the correct type. A text is correct if the entity boundaries are correct, regardless of their type, while a type is correct if an entity is tagged with a correct type, regardless of the boundaries, as long as there is an overlap with the gold type. The final score is a micro average f-measure (see Nadeau and Sekine (2007) for more details). Table 2 shows the performance of the model along the two axes, as well as the MUC score. We also report the same metrics for the supervised model proposed in Peters et al. (2018). We see that the model primarily loses out on the text based measure, and performs quite well on the type based one. This is in accordance with our hypothesis that the model identifies the correct entities, but fails at finding the exact entity boundaries. A better boundary detection method can consequently be used alongside our model to improve the entity retrievals in a downstream task. show some of the shortcomings. As described in 5.1, indicator tokens like President (Example 3) are usually marked as the PER entity type, which gets penalized when compared to the gold labels. Another common failure case we observe is when outside domain knowledge is necessary to disambiguate the entity type. For example, in Example 4, the model predicts Philadelphia as LOC, while the correct tag is ORG (referring to the Philadelphia Eagles). Similarly, in Example 5, the model classifies Hampshire as LOC, while the correct tag is ORG (the cricket club).

Conclusion
We present a novel method for entity recognition using a relatively weak supervision signal. Our proposed model, trained on a multi-label classification task, achieves reasonable entity recognition performance.
While our proposed method is simple, we demonstrate that it works surprisingly well. Various other formulations for this task are possible: for example, one might involve marginalizing over the tags and using this for predicting a label; another could perform attention over spans instead of tokens. We plan on investigating these alternate approaches in future work.