Contextualized Word Representations from Distant Supervision with and for NER

We describe a special type of deep contextualized word representation that is learned from distant supervision annotations and dedicated to named entity recognition. Our extensive experiments on 7 datasets show systematic gains across all domains over strong baselines, and demonstrate that our representation is complementary to previously proposed embeddings. We report new state-of-the-art results on CONLL and ONTONOTES datasets.

Our main contribution in this work is to revisit the work of Ghaddar and Langlais (2018a) that explores distant supervision for learning classic word representations, used later as features for Named Entity Recognition (NER). Motivated by the recent success of pre-trained language model embeddings, we propose a contextualized word representation trained on the distant supervision material made available by the authors. We do so by training a model to predict the entity type of each word in a given sequence (e.g. paragraph).
We run extensive experiments feeding our representation, along side with previously proposed traditional and contextualized ones, as features to a vanilla Bi-LSTM-CRF (Ma and Hovy, 2016). Results shows that our contextualized representation leads to significant boost in performances on 7 NER datasets of various sizes and domains. The proposed representation surpasses the one of Ghaddar and Langlais (2018a) and is complementary to popular contextualized embeddings like ELMo (Peters et al., 2018).

Related Work
Pre-trained contextualized word-embeddings have shown great success in NLP due to their ability to capture both syntactic and semantic properties. ELMo representations (Peters et al., 2018) are built from internal states of forward and backward word-level language models. Akbik et al. (2018) showed that pure character-level language models can also be used. Also, McCann et al. (2017) used the encoder of a machine translation model to compute contextualized representations. Recently, (Devlin et al., 2018) proposed BERT, an encoder based on the Transformer architecture (Vaswani et al., 2017). To overcome the unidirectionality of the language model objective, the authors propose two novel tasks for unsupervised learning: masked words and next sentence prediction.
Ghaddar and Langlais (2018a) applied distant supervision (Mintz et al., 2009) in order to induce traditional word representations. They used WiFiNE 1 (Ghaddar and Langlais, 2018b, 2017), a Wikipedia dump with massive amount of automatically annotated entities, using the fine-grained tagset proposed in (Ling and Weld, 2012). Making use of Fasttext (Bojanowski et al., 2016), they embedded words and (noisy) entity types in this resource into the same space from which they induced a 120-dimensional word-representation, where each dimension encodes the similarity of a word with one of the 120 types they considered. While the authors claim the resulting representation captures contextual information, they do not specifically train it to do so. Our work revisits precisely this.

Data and Preprocessing
We leverage the entity type annotations in WiFiNE which consists of 1.3B tokens annotated with 159.4M mentions, which cover 15% of the tokens. A significant amount of named entities such as person names and countries can actually be resolved via their mention tokens only (Ghaddar and Langlais, 2016a,b). With the hope to enforce context, we use the fine-grained type annotation available in the resource (e.g. /person/politician). Also, inspired by the recent success of masked-word prediction (Devlin et al., 2018), we further apply preprocessing to the original annotations by (a) replacing an entity by a special token [MASK] with a probability of 0.2, and (b) replacing primary entity mentions, e.g. all mentions of Barack Obama within its dedicated article, by the special mask token with a probability of 0.5. In WiFiNE, named-entities that do not have a Wikipedia article (e.g. Malia Ann in Figure 2) are left unannotated, which introduces false negatives. Therefore, we mask non-entity words when we calculate the loss.
Although contextualized representation learning has access to arbitrary large contexts (e.g. the document), in practice representations mainly depend on sentence level context (Chang et al., 2019). To overcome this limitation to some extent, we use the Wikipedia layout provided in WiFiNE to concatenate sentences of the same paragraphs, sections and document up to a maximum size of 512 tokens.
An illustration of the preprocessing is depicted in Figure 2 where for the sake of space, a single sentence is being showed. Masked entities encourage the model to learn good representations for non-entity words even if they do not participate in the final loss. Because our examples are sections and paragraphs, the model will be forced to encode sentence-as well as document-based context. In addition, training on (longer) paragraphs is much faster and memory efficient than batching sentences.

Learning our Representation
We use a model (Figure 1) composed of a multilayer bidirectional encoder that produces hidden states for each token in the input sequence. At the output layer, the last hidden states are fed into a softmax layer for predicting entity types. Following (Strubell et al., 2017), we used as our encoder the Dilated Convolutional Neural Network (DCNN) with an exponential increasing dilated width. DCNN was first proposed by (Yu and Koltun, 2015) for image segmentation, and was successfully deployed for NER by (Strubell et al., 2017). The authors show that stacked layers of DCNN that incorporate document context have comparable performance to Bi-LSTM while being 8 times faster. DCNN with a size 3 convolution window needs 8 stacked layers to incorporate the entire input context of a sequence of 512 tokens, compared to 255 layers using a regular CNN. This greatly reduces the number of parameters and makes training more scalable and efficient. Because our examples are paragraphs rather than sentences, we employ a self-attention mechanism on top of DCNN output with the aim to encourage the model to focus on salient global information. In this paper, we adopt the multi-head self-attention formulation by Vaswani et al. (2017). Comparatively, Transformer-based architectures (Devlin et al., 2018) require a much larger 2 amount of resources and computations. To improve the handling of rare and unknown words, our input sequence consists of WordPiece embeddings (Wu et al., 2016)

Datasets
To compare with state-of-the-art models, we consider two well-established NER benchmarks: CONLL-2003 (Tjong Kim Sang andDe Meulder, 2003)  tags /person/politician X X X X X X X X X /date /date X /location/city X /location/province X Figure 2: Sequence before and after masking, along with output tags. X indicates that no prediction is made for the corresponding token.
To further determine how useful our learned representation is on other domains, we also considered three additional datasets: WNUT17 (

Input Representations
Our NER model is a vanilla Bi-LSTM-CRF (Ma and Hovy, 2016) that we feed with various representations (hereafter described) at the input layer.
Model parameters and training details are provided in Appendix A.3.
We randomly allocate a 25-dimensional vector for each feature, and learn them during training.

Traditional Word Embeddings
We use the 100-dimensional case sensitive SSKIP (Ling et al., 2015) word embeddings. We also compare with the previously described 120dimensional vector representation of (Ghaddar and Langlais, 2018a), they call it LS.

Contextualized Word Embeddings
We tested 3 publicly available contextualized   (2018) show that concatenation performs reasonably well in many NLP tasks.

Comparison to LS embeddings
Since we used the very same distant supervision material for training our contextual representation, we compare it to the one of Ghaddar and Langlais (2018a). We concentrate on CONLL-2003 and ONTONOTES 5.0, the datasets most often used for benchmarking NER systems. Table 1 reports results of 4 strong baselines that use popular embeddings (column X ), further adding either the LS representation (Ghaddar and Langlais, 2018a) or ours. In all experiments, we report the results on the test portion of models performing the best on the official development set of each dataset. As a point of comparison, we also report 2018 state-of-the-art systems.
First we observe that adding our representation to all baseline models leads to systematic improvements, even for the very strong baseline which exploits all three contextual representations (fourth line). The LS representation does not deliver such gains, which demonstrates that our way of exploiting the very same distant supervision material is more efficient. Second, we see that adding our representation to the weakest baseline (line 1), while giving a significant boost, does not deliver as good performance as when adding other contextual embeddings. Nevertheless, combining all embeddings yields state-of-the-art on both CONLL and ONTONOTES. Table 2 reports F1 scores on the test portion of the 7 datasets we considered, for models trained with different embedding combinations. Our baseline is composed of word-shape and traditional (SSKIP) embeddings. Then, contextualized word representations are added greedily, that is, the representation that yields the largest gain when considered is added first and so forth.

Comparing Contextualized Embeddings
Expectedly, ELMo is the best representation to add to the baseline configuration, with significant F1 gains for all test sets. We are pleased to observe that the next best representation to consider is ours, significantly outperforming FLAIR. This is likely due to the fact that both FLAIR and ELMo embeddings are obtained by training a language model, therefore encoding similar information.
Continuously aggregating other contextual embeddings (FLAIR and BERT) leads to some improvements on some datasets, and degradations on others. In particular, stacking all representations leads to the best performance on 2 datasets only: ONTONOTES and I2B2. Those datasets are large, domain diversified, and have more tags than other ones. In any case, stacking word-shapes, SSKIP, ELMo and our representation leads to a strong configuration across all datasets. Adding our representation to ELMo, actually brings noticeable gains (over 2 absolute F1 points) in out-domain settings, a very positive outcome.
Surprisingly, BERT did not perform as we expected, since they bring minor (ONTONOTES) or no (CONLL) improvement. We tried to reproduce the results of fine-tuned and feature-based approaches reported by the authors on CONLL,  but as many others, 3 our results were disappointing.

Analysis
We suspect one reason for the success of our representation is that it captures document wise context. We inspected the words the most attended according to the the self-attention layer of some documents, an excerpt of which is reported in Figure 3. We observe that attended words in the document are often related to the topic of the document.
84 economic Stock, mark, Wall, Treasury, bond 148 sport World, team, record, game, win 201 news truck, Fire, store, hospital, arms We further checked whether the gain could be imputable to the fact that WiFiNE contains the mentions that appear in the test sets we considered. While this of course happens (for instance 38% of the test mentions in ONTONOTES are in the resource), the performance on those mentions with our representation is no better than the performance on other mentions.

Conclusion and Future Work
We have explored the idea of generating a contextualized word representation from distant supervision annotations coming from Wikipedia, improving over the static representation of Ghaddar and Langlais (2018a). When combined with 3 https://github.com/google-research/ bert/issues?utf8=%E2%9C%93&q=NER popular contextual ones, our representation leads to state-of-the-art performance on both CONLL and ONTONOTES. We are currently analyzing the complementarity of our representation to others.
We plan to investigate tasks such as coreference resolution and non-extractive machine reading comprehension, where document level context and entity type information is crucial. The source code and the pre-trained models we used in this work are publicly available at http://rali.iro.umontreal.ca/ rali/en/wikipedia-ds-cont-emb

A.1 Training Representation
We use 8 stacked layers of DCNN to encode input sequences of maximum length of 512. WordPiece and position embeddings, number of filters in each dilated layer and self-attention hidden units were all set to 384. For self-attention, we use 6 attention heads and set intermediate hidden unit to 512. We apply a dropout mask (Srivastava et al., 2014) with a probability of 0.3 at the end of each DCNN layer, and at the input and output of the self-attention layer. We adopt the Adam (Kingma and Ba, 2014) optimization algorithm, set the initial learning rate to 1e −4 , and use an exponential decay. We train our model up to 1.5 millions steps with mini-batch size of 64. We implemented our system using the Tensorflow (Abadi et al., 2016) library, and training requires about 5 days on a single TITAN XP GPU.  We used the last 2 datasets to perform an out-ofdomain evaluation of CONLL models. Those are small datasets extracted from Wikipedia articles and web pages respectively, and manually annotated following CONLL-2003 annotation scheme.

A.3 NER Model Training
Our system is a single Bi-LSTM layer with a CRF decoder, with 128 hidden units for all datasets except for ONTONOTES and I2B2 where we use 256 hidden units. For each learned representations (ours, ELMo, FLAIR, BERT), we use the weighted sum of all layers as input, where weights are learned during training. For each word, we stack the embeddings by concatenating them to form the input feature of the encoder.
Training is carried out by mini-batch of stochastic gradient descent (SGD) with a momentum of 0.9 and a gradient clipping of 5.0. To mitigate over-fitting, we apply a dropout mask with a probability of 0.7 on the input and output vectors of the Bi-LSTM layer. The mini-batch is 10 and learning rate is 0.011 for all datasets. We trained the models up to 63 epochs and use early stopping based on the official development set. For FIN, we randomly sampled 10% of the train set for development.