Global Attention for Name Tagging

Many name tagging approaches use local contextual information with much success, but can fail when the local context is ambiguous or limited. We present a new framework to improve name tagging by utilizing local, document-level, and corpus-level contextual information. For each word, we retrieve document-level context from other sentences within the same document and corpus-level context from sentences in other documents. We propose a model that learns to incorporate document-level and corpus-level contextual information alongside local contextual information via document-level and corpus-level attentions, which dynamically weight their respective contextual information and determines the influence of this information through gating mechanisms. Experiments on benchmark datasets show the effectiveness of our approach, which achieves state-of-the-art results for Dutch, German, and Spanish on the CoNLL-2002 and CoNLL-2003 datasets. We will make our code and pre-trained models publicly available for research purposes.

Additional context may be found from other sentences in the same document as the query sentence (document-level). In Figure 1, the sentences in the document-level supporting evidence provide clearer clues to tag "Zywiec" as ORG, such as the references to "Zywiec" as a "firm". A concern of leveraging this information is the amount of noise that is introduced. However, across all the data in our experiments (Section 3.1), we find that an average of 35.43% of named entity mentions in each document are repeats and, when a mention appears more than once in a document, an average of 98.78% of these mentions have the same type. Consequently, one may use the documentlevel context to overcome the ambiguities of the local context while introducing little noise.
Although a significant amount of named entity mentions are repeated, 64.57% of the mentions are unique. In such cases, the sentences at the document-level cannot serve as a source of additional context. Nevertheless, one may find additional context from sentences in other documents in the corpus (corpus-level). Figure 1 shows some of the corpus-level supporting evidence for "Zywiec". In this example, similar to the document-level supporting evidence, the first sentence in this corpus-level evidence discusses the branding of "Zywiec", corroborating the ORG tag. Whereas the second sentence introduces noise because it has a different topic than the current sentence and discusses the Polish town named "Zywiec", one may filter these noisy contexts, especially when the noisy contexts are accompanied by clear contexts like the first sentence.
We propose to utilize local, document-level, and corpus-level contextual information to improve name tagging. Generally, we follow the one sense per discourse hypothesis introduced by Yarowsky (2003). Some previous name tagging efforts apply this hypothesis to conduct majority voting for multiple mentions with the same name string in a discourse through a cache model (Florian et al., 2004) or post-processing (Hermjakob et al., 2017). However, these rule-based methods require manual tuning of thresholds. Moreover, it's challenging to explicitly define the scope of discourse. We propose a new neural network framework with global attention to tackle these challenges. Specifically, for each token in a query sentence, we propose to retrieve sentences that contain the same token from the document-level and corpuslevel contexts (e.g., document-level and corpuslevel supporting evidence for "Zywiec" in Figure 1). To utilize this additional information, we propose a model that, first, produces representations for each token that encode the local context from the query sentence as well as the documentlevel and corpus-level contexts from the retrieved sentences. Our model uses a document-level at-tention and corpus-level attention to dynamically weight the document-level and corpus-level contextual representations, emphasizing the contextual information from each level that is most relevant to the local context and filtering noise such as the irrelevant information from the mention "[LOC Zywiec]" in Figure 1. The model learns to balance the influence of the local, documentlevel, and corpus-level contextual representations via gating mechanisms. Our model predicts a tag using the local, gated-attentive document-level, and gated-attentive corpus-level contextual representations, which allows our model to predict the correct tag, ORG, for "Zywiec" in Figure 1.
The major contributions of this paper are: First, we propose to use multiple levels of contextual information (local, document-level, and corpuslevel) to improve name tagging. Second, we present two new attentions, document-level and corpus-level, which prove to be effective at exploiting extra contextual information and achieve the state-of-the-art.

Model
We first introduce our baseline model. Then, we enhance this baseline model by adding documentlevel and corpus-level contextual information to the prediction process via our document-level and corpus-level attention mechanisms, respectively.

Baseline
We consider name tagging as a sequence labeling problem, where each token in a sequence is tagged as the beginning (B), inside (I) or outside (O) of a name mention. The tagged names are then classified into predefined entity types. In this paper, we only use the person (PER), organization (ORG), location (LOC), and miscellaneous (MISC) types, which are the predefined types in CoNLL-02 and CoNLL-03 name tagging dataset (Tjong Kim Sang and De Meulder, 2003).
Our baseline model has two parts: 1) Encoding the sequence of tokens by incorporating the preceding and following contexts using a bi-directional long short-term memory (Bi-LSTM) (Graves et al., 2013), so each token is assigned a local contextual embedding. Here, following Ma and Hovy (2016a), we use the concatenation of pre-trained word embeddings and character-level word representations composed by a convolutional neural network (CNN) as input to the Bi-LSTM. 2) Using a Conditional Random Fields (CRFs) output layer to render predictions for each token, which can efficiently capture dependencies among name tags (e.g., "I-LOC" cannot follow "B-ORG").
The Bi-LSTM CRF network is a strong baseline due to its remarkable capability of modeling contextual information and label dependencies. Many recent efforts combine the Bi-LSTM CRF network with language modeling (Liu et al., 2017;Peters et al., 2017Peters et al., , 2018 to boost the name tagging performance. However, they still suffer from the limited contexts within individual sequences. To overcome this limitation, we introduce two attention mechanisms to incorporate document-level and corpus-level supporting evidence.

Document-level Attention
Many entity mentions are tagged as multiple types by the baseline approach within the same document due to ambiguous contexts (14.43% of the errors in English, 18.55% in Dutch, and 17.81% in German). This type of error is challenging to address as most of the current neural network based approaches focus on evidence within the sentence when making decisions. In cases where a sentence is short or highly ambiguous, the model may either fail to identify names due to insufficient information or make wrong decisions by using noisy context. In contrast, a human in this situation may seek additional evidence from other sentences within the same document to improve judgments.
In Figure 1, the baseline model mistakenly tags "Zywiec" as PER due to the ambiguous context "whose full name is...", which frequently appears around a person's name. However, contexts from other sentences in the same document containing "Zywiec" (e.g., s q and s r in Figure 2), such as "'s 1996 profit..." and "would be boosted by its recent shedding...", indicate that "Zywiec" ought to be tagged as ORG. Thus, we incorporate the document-level supporting evidence with the following attention mechanism (Bahdanau et al., 2015).
Formally, given a document D = {s 1 , s 2 , ...}, where s i = {w i1 , w i2 , ...} is a sequence of words, we apply a Bi-LSTM to each word in s i , generating local contextual representations h i = {h i1 , h i2 , ...}. Next, for each w ij , we retrieve the sentences in the document that contain w ij (e.g., s q and s r in Figure 2) and select the local contextual representations of w ij from these sentences as supporting evidence,h ij = {h 1 ij ,h 2 ij , ...} (e.g.,h qj andh rk in Figure 2), where h ij andh ij are obtained with the same Bi-LSTM. Since each representation in the supporting evidence is not equally valuable to the final prediction, we apply an attention mechanism to weight the contextual representations of the supporting evidence: where h ij is the local contextual representation of word j in sentence s i andh k ij is the k-th supporting contextual representation. W h , Wh and b e are learned parameters. We compute the weighted average of the supporting representations bỹ whereH ij denotes the contextual representation of the supporting evidence for w ij .
For each word w ij , its supporting evidence representation,H ij , provides a summary of the other contexts where the word appears. Though this evidence is valuable to the prediction process, we must mitigate the influence of the supporting evidence since the prediction should still be made primarily based on the query context. Therefore, we apply a gating mechanism to constrain this influence and enable the model to decide the amount of the supporting evidence that should be incorporated in the prediction process, which is given by where all W , b are learned parameters and D ij is the gated supporting evidence representation for w ij .

Topic-aware Corpus-level Attention
The document-level attention fails to generate supporting evidence when the name appears only once in a single document. In such situations, we analogously select supporting sentences from the entire corpus. Unfortunately, different from So far thi Zaklady Pi million zl So far this year Zywiec , whose full name is Zaklady Piwowarskie w Zywcu SA , has netted six million zlotys on sales of 224 million zlotys .
Polish brewer Zywiec 's 1996 profit slump may last into next year due in part to hefty depreciation charges , but recent high investment should help the firm defend its 00percent market share , the firm 's chief executive said .
Van Boxmeer also said Zywiec would be boosted by its recent shedding of soft drinks which only accounted for about three percent of the firm 's overall sales and for which 0.0 million zlotys in provisions had already been made .
The two largest brands are Heineken and Amstel.
The list includes Cruzcampo, Affligem and Zywiec . the sentences that are naturally topically relevant within the same documents, the supporting sentences from the other documents may be about distinct topics or scenarios, and identical phrases may refer to various entities with different types, as in the example in Figure 1. To narrow down the search scope from the entire corpus and avoid unnecessary noise, we introduce a topic-aware corpus-level attention which clusters the documents by topic and carefully selects topically related sentences to use as supporting evidence.
We first apply Latent Dirichlet allocation (LDA) (Blei et al., 2003) to model the topic distribution of each document and separate the documents into N clusters based on their topic distributions. 2 As in Figure 3, we retrieve supporting sentences for each word, such as "Zywiec", from the topically related documents and employ another attention mechanism (Bahdanau et al., 2015) to the supporting contextual representations,ĥ ij = {ĥ 1 ij ,ĥ 2 ij , ...} (e.g.,h xi andh yi in Figure 3). This yields a weighted contextual representation of the corpus-level supporting evidence,Ĥ ij , for each w ij , which is similar to the document-level supporting evidence representation,H ij , described in 2 N = 20 in our experiments. section 2.2. We use another gating mechanism to combineĤ ij and the local contextual representation, h ij , to obtain the corpus-level gated supporting evidence representation, C ij , for each w ij .
So far this year Zywiec, whose full name is Zaklady Piwowarskie w Zywcu SA , has netted six million zlotys on sales of 224 million zlotys .
ar Zywiec , whose full name is Zaklady Piwowarskie w Zywcu SA , has lion zlotys on sales of 224 million zlotys .
Zywiec 's 1996 profit slump may last into next year due in part to tion charges , but recent high investment should help the firm defend market share , the firm 's chief executive said .
so said Zywiec would be boosted by its recent shedding of soft drinks ounted for about three percent of the firm 's overall sales and for ion zlotys in provisions had already been made .
The two largest brands are Heineken and Amstel.

Tag Prediction
For each word w ij of sentence s i , we concatenate its local contextual representation h ij , documentlevel gated supporting evidence representation D ij , and corpus-level gated supporting evidence representation C ij to obtain its final representation. This representation is fed to another Bi-LSTM to further encode the supporting evidence and local contextual features into an unified representation, which is given as input to an affine-CRF layer for label prediction.

Dataset
We evaluate our methods on the CoNLL-2002 and CoNLL-2003 name tagging datasets (Tjong Kim Sang and De Meulder, 2003

Experimental Setup
For word representations, we use 100-dimensional pre-trained word embeddings and 25-dimensional randomly initialized character embeddings. We train word embeddings using the word2vec package. 5 English embeddings are trained on the English Giga-word version 4, which is the same corpus used in (Lample et al., 2016). Dutch, Spanish, and German embeddings are trained on corresponding Wikipedia articles (2017-12-20 dumps). Word embeddings are fine-tuned during training. Table 2 shows our hyper-parameters. For each model with an attention, since the Bi-LSTM encoder must encode the local, documentlevel, and/or corpus-level contexts, we pre-train a Bi-LSTM CRF model for 50 epochs, add our document-level attention and/or corpus-level attention, and then fine-tune the augmented model. Additionally, Reimers and Gurevych (2017) report that neural models produce different results even with same hyper-parameters due to the variances in parameter initialization. Therefore, we run each model ten times and report the mean as well as the maximum F1 scores.

Performance Comparison
We compare our methods to three categories of baseline name tagging methods: • Vanilla Name Tagging Without any additional resources and supervision, the current state-ofthe-art name tagging model is the Bi-LSTM-CRF network reported by Lample et al. (2016) and Ma and Hovy (2016b), whose difference lies in using a LSTM or CNN to encode characters. Our methods fall in this category.    (Chelba et al., 2013). Table 3 presents the performance comparison among the baselines, the aforementioned stateof-the-art methods, and our proposed methods. Adding only the document-level attention offers a F1 gain of between 0.37% and 1.25% on Dutch, English, and German. Similarly, the addition of the corpus-level attention yields a F1 gain between 0.46% to 1.08% across all four languages. The model with both attentions outperforms our baseline method by 1.60%, 0.56%, and 0.79% on Dutch, English, and German, respectively. Using a paired t-test between our proposed model and the baselines on 10 randomly sampled subsets, we find that the improvements are statistically significant (p ≤ 0.015) for all settings and all languages. By incorporating the document-level and corpus-level attentions, we achieve state-of-the-art performance on the Dutch (NLD), Spanish (ESP) and German (DEU) datasets. For English, our methods outperform the state-of-the-art methods in the "Vanilla Name Tagging" category. Since the document-level and corpus-level attentions introduce redundant and topically related information, our models are compatible with the language model enhanced approaches. It is interesting to explore the integration of these two methods, but we leave this to future explorations. Figure 4 presents, for each language, the learning curves of the full models (i.e., with both document-level and corpus-level attentions). The learning curve is computed by averaging the F1 scores of the ten runs at each epoch. We first pretrain a baseline Bi-LSTM CRF model from epoch 1 to 50. Then, starting at epoch 51, we incorporate the document-level and corpus-level attentions to fine-tune the entire model. As shown in Figure 4, when adding the attentions at epoch 51, the F1 score drops significantly as new parameters are introduced to the model. The model gradually adapts to the new information, the F1 score rises, and the full model eventually outperforms the pretrained model. The learning curves strongly prove the effectiveness of our proposed methods.
Code Model F1 (%) (Gillick et al., 2015) reported 82.84 (Lample et al., 2016) reported 81 (Gillick et al., 2015) reported 82.95 (Lample et al., 2016) reported 85.75 (Yang et al., 2017) reported 85.77 Our Baseline mean 85.33 max 85.51 Corpus-lvl Attention mean 85.77 max 86.01 ∆ +0.50 (Luo et al., 2015) reported 91.20 ENG (Lample et al., 2016) reported 90.94 (Ma and Hovy, 2016b) reported 91.21 (Liu et al., 2017) reported 91  We also compare our approach with a simple rule-based propagation method, where we use token-level majority voting to make labels consistent on document-level and corpus-level. The score of document-level propagation on English is 90.21% (F1), and the corpus-level propagation is 89.02% which are both lower than the BiLSTM-CRF baseline 90.97%. Table 5 compares the name tagging results from the baseline model and our best models. All ex-amples are selected from the development set.

Qualitative Analysis
In the Dutch example, "Granada" is the name of a city in Spain, but also the short name of "Granada Media". Without ORG related context, "Granada" is mistakenly tagged as LOC by the baseline model. However, the document-level and corpus-level supporting evidence retrieved by our method contains the ORG name "Granada Media", which strongly indicates "Granada" to be an ORG in the query sentence. By adding the document-level and corpus-level attentions, our model successfully tags "Granada" as ORG.
In example 2, the OOV word "Kaczmarek" is tagged as ORG in the baseline output. In the retrieved document-level supporting sentences, PER related contextual information, such as the pronoun "he", indicates "Kaczmarek" to be a PER. Our model correctly tags "Kaczmarek" as PER with the document-level attention.
In the German example, "Grünen" (Greens) is an OOV word in the training set. The character embedding captures the semantic meaning of the stem "Grün" (Green) which is a common nonname word, so the baseline model tags "Grünen" as O (outside of a name). In contrast, our model makes the correct prediction by incorporating the corpus-level attention because in the related sentence from the corpus "Bundesvorstandes der Grünen" (Federal Executive of the Greens) indicates "Grünen" to be a company name.

Remaining Challenges
By investigating the remaining errors, most of the named entity type inconsistency errors are eliminated, however, a few new errors are introduced due to the model propagating labels from negative instances to positive ones. Figure 5 presents a negative example, where our model, being influenced by the prediction "[B-ORG Indianapolis]" in the supporting sentence, incorrectly predicts "Indianapolis" as ORG in the query sentence. A potential solution is to apply sentence classification (Kim, 2014;Ji and Smith, 2017) to the documents, divide the document into finegrained clusters of sentences, and select supporting sentences within the same cluster.
In morphologically rich languages, words may have many variants. When retrieving supporting evidence, our exact query word match criterion misses potentially useful supporting sentences that contain variants of the word. Normalization and #1 Dutch Baseline [B-LOC Granada] overwoog vervolgens een bod op Carlton uit te brengen, maar daar ziet het concern nu van af. Granada then considered issuing a bid for Carlton, but the concern now sees it.  morphological analysis can be applied in this case to help fetch supporting sentences.

Related Work
Name tagging methods based on sequence labeling have been extensively studied recently.  and Lample et al. (2016) proposed a neural architecture consisting of a bi-directional long short-term memory network (Bi-LSTM) encoder and a conditional random field (CRF) output layer (Bi-LSTM CRF). This architecture has been widely explored and demonstrated to be effective for sequence labeling tasks. Efforts incorporated character level compositional word embeddings, language modeling, and CRF re-ranking into the Bi-LSTM CRF architecture which improved the performance (Ma and Hovy, 2016a;Liu et al., 2017;Sato et al., 2017;Peters et al., 2017Peters et al., , 2018. Similar to these studies, our approach is also based on a Bi-LSTM CRF architecture. However, considering the limited contexts within each individual sequence, we design two attention mechanisms to further incorporate topically related contextual information on both the document-level and corpus-level. There have been efforts in other areas of information extraction to exploit features beyond individual sequences. Early attempts (Mikheev et al., 1998;Mikheev, 2000) on MUC-7 name tagging dataset used document centered approaches. A number of approaches explored document-level features (e.g., temporal and co-occurrence patterns) for event extraction (Chambers and Jurafsky, 2008;Ji and Grishman, 2008;Liao and Grishman, 2010;Do et al., 2012;McClosky and Manning, 2012;Berant et al., 2014;Yang and Mitchell, 2016). Other approaches leveraged features from external resources (e.g., Wiktionary or FrameNet) for low resource name tagging and event extraction (Li et al., 2013;Huang et al., 2016;Liu et al., 2016;Zhang et al., 2016;Cotterell and Duh, 2017;Huang et al., 2018). Yaghoobzadeh and Schütze (2016) aggregated corpus-level contextual information of each entity to predict its type and Narasimhan et al. (2016) incorporated contexts from external information sources (e.g., the documents that contain the desired information) to resolve ambiguities. Compared with these studies, our work incorporates both document-level and corpus-level con-textual information with attention mechanisms, which is a more advanced and efficient way to capture meaningful additional features. Additionally, our model is able to learn how to regulate the influence of the information outside the local context using gating mechanisms.

Conclusions and Future Work
We propose document-level and corpus-level attentions for name tagging. The document-level attention retrieves additional supporting evidence from other sentences within the document to enhance the local contextual information of the query word. When the query word is unique in the document, the corpus-level attention searches for topically related sentences in the corpus. Both attentions dynamically weight the retrieved contextual information and emphasize the information most relevant to the query context. We present gating mechanisms that allow the model to regulate the influence of the supporting evidence on the predictions. Experiments demonstrate the effectiveness of our approach, which achieves stateof-the-art results on benchmark datasets.
We plan to apply our method to other tasks, such as event extraction, and explore integrating language modeling into this architecture to further boost name tagging performance.