NLNDE: Enhancing Neural Sequence Taggers with Attention and Noisy Channel for Robust Pharmacological Entity Detection

Named entity recognition has been extensively studied on English news texts. However, the transfer to other domains and languages is still a challenging problem. In this paper, we describe the system with which we participated in the first subtrack of the PharmaCoNER competition of the BioNLP Open Shared Tasks 2019. Aiming at pharmacological entity detection in Spanish texts, the task provides a non-standard domain and language setting. However, we propose an architecture that requires neither language nor domain expertise. We treat the task as a sequence labeling task and experiment with attention-based embedding selection and the training on automatically annotated data to further improve our system’s performance. Our system achieves promising results, especially by combining the different techniques, and reaches up to 88.6% F1 in the competition.


Introduction
The detection and classification of pharmacological and biomedical entities in texts is especially challenging due to the domain's nature with long and complex entity names, which usually requires the design and usage of handcrafted rules and features.Natural language processing (NLP) research focused on this topic for quite a while on English texts, e.g., the drugs and chemical names extraction challenge (CHEMDNER) (Krallinger et al., 2015) or tracks for chemical entity recognition at BioCreative (Pérez-Pérez et al., 2017).Following these tasks, the Pharmacological Substances, Compounds and Proteins and Named Entity Recognition track (PharmaCoNER) is the first competition on this topic on Spanish data (Gonzalez-Agirre et al., 2019).
Named entity recognition (NER) and classification is the first subtrack of Pharma- CoNER and aims at distinguishing four entity types: PROTEINAS, NORMALIZABLES, NO-NORMALIZABLES, and UNCLEAR.Our model was trained on all four entity types, although the NO-NORMALIZABLES type was not considered during the official evaluation due to its ambiguous definition.Two annotated sample sentences from the training data are shown in Figure 1.
In this paper, we describe our submissions to and their results in the first subtrack of Phar-maCoNER.We address this task as a sequencelabeling problem and implement a system that relies Neither on Language Nor on Domain Expertise (NLNDE).For this, we use a combination of different state-of-the-art approaches from NLP to tackle its challenges without the need for handcrafted features.
We train recurrent neural networks with conditional random field (CRF) output layers which are state of the art for different sequence labeling tasks, such as named entity recognition (Lample et al., 2016), part-of-speech tagging (Kemos et al., 2019) and de-identification (Liu et al., 2017).In our different runs, we further explore the advantages of domain-specific fastText embeddings that have been pre-trained on SciELO and Wikipedia articles (Soares et al., 2019) to investigate the impact of domain knowledge.Note that the training of these embeddings requires only a collection of domain-specific text but no human domain exper- tise.Based on these models, we train an attentionbased embedding selection function in order to leverage multiple different word embeddings effectively.Finally, we extend the training data with automatically annotated data, which was sampled from the same domain and annotated with information from Wikidata. 1

Methods
In this section, we present our system, the attention function for embedding selection, and the noisy channel model.

NLNDE System
In Figure 2, the architecture of our models is depicted, which we explain in the following.
Input Embeddings.We tokenize the input with the tokenizer provided by the shared task organizers (Intxaurrondo, 2019).We noticed that the tokenizer sometimes merges multi-word expressions into a single token joined with underscores for contiguous words.As a result, some tokens cannot be aligned with the corresponding entity annotations.To address this, we split those tokens into their components in a postprocessing step.Then, we represent each token with the following embeddings (see bottom right box of Figure 2): • Character embeddings: We use the concatenated last forward and backward hidden states of a bidirectional long short-term 1 https://www.wikidata.org/memory (BiLSTM) network (Hochreiter and Schmidhuber, 1997) over character embeddings (50 dimensions, randomly initialized, fine-tuned during training (Lample et al., 2016)).
Note that except for the character embeddings, we do not fine-tune any of the embeddings.All embeddings are concatenated into a single word representation vector.
Word Features.We also experiment with extending the input representations with the following features: • Part-of-speech (POS): The POS tags are generated by the POS-tagger provided by the shared task organizers (Intxaurrondo, 2019).
The tags are embedded into a 20-dimensional randomly initialized embedding and learned during training.The embedded vector is used as the representation for the POS tag.
• Length: For each word, we encode its length in a one-hot vector.Words with more than nine characters share the same vector (10 dimensions).
• Frequency: We consider the relative frequency f of each word and bin the frequencies into ten groups.The first group contains the most frequent words that have relative frequencies above 1% (f > 1%).The remaining bins are constructed in the following manner: f > 0.5%, f > 0.1%, f > 0.05%, etc. (one-hot encoded, 10 dimensions).
All features are concatenated into a single feature vector f of 50 dimensions.
BiLSTM-CRF Layers.The input representation is fed into a BiLSTM with a conditional random field (CRF) output layer, similar to the model of Lample et al. (2016).The CRF output layer is a linear-chain CRF, i.e., it learns transition scores between the output classes.For training, the forward algorithm is used to sum the scores for all possible sequences.During decoding, the Viterbi algorithm is applied to obtain the sequence with the maximum score.

Hyperparameters and
Training.The hyperparameters are the same across all runs.We use a BiLSTM hidden size of 256 and train the network with the NADAM optimizer (Dozat, 2016) using a learning rate of 0.002 and a batch size of 32.For regularization, we employ early stopping on the development set and apply dropout with probability 0.5 on the input representations.

Attention for Embedding Selection
As we are combining different word embeddings, some of them may be more beneficial for certain words than others, e.g., domain-specific embeddings for in-domain words.Kiela et al. (2018) used an attention mechanism for weighting and selecting the best embeddings for each word.We extend this idea and propose the following attention function to weight the embeddings depending on additional word features.
For the attention-based models, all n embeddings e are mapped to the same size using a linear mapping Q i ∈ R E×E i without bias, with x i ∈ R E being the i-th embedding e i mapped from their original size E i to the maximal embedding size E = max m (E m ).
In order to allow the model to make an informed decision which embeddings to focus on, we use the word features described in Section 2.1 as an additional input to the attention function.The vector f ∈ R F representing the features for each word is concatenated to each embedding x i .The attention weight a i for each embedding x i is computed with the softmax function, by feeding x i and f into a fully-connected hidden layer of size H with the parameters Finally, the embeddings x i are weighted using the attention weights a i resulting in the word representation: Then, this word representation e ∈ R E is fed into the BiLSTM-CRF.Compared to a concatenation of the different embeddings, this results in a lower-dimensional word representation and, thus, requires fewer parameters in the BiLSTM layer.The attention-based embedding selection is shown in the upper right box of Figure 2.

Training on Noisy Data
As it was shown in multiple low-resource settings (Dgani et al., 2018;Fang and Cohn, 2016;Mnih and Hinton, 2012;Paul et al., 2019;Yang et al., 2018), the performance of NER and other NLP systems can be substantially improved by training on additional noisy data which is labeled in a distantly supervised manner (Mintz et al., 2009).With this approach, the noisy data is cheap to create, but also error-prone and can even decrease performance if used as training data without noise handling as shown by Hedderich and Klakow (2018).
Extraction of Noisy Data.We create gazetteers for the different entity types by extracting names and aliases of possible entities from Wikidata for the following categories and their subclasses: 2 • PROTEINAS: enzyme, gene, hormone, protein.
• UNCLEAR and NO-NORMALIZABLES: The gazetteer was constructed from entity mentions in the training data that appeared at least twice and examples from the annotation guidelines.
Then, we retrieve unlabeled documents from the same domain from the SciELO archive (Packer, 1998).Finally, we use the extracted gazetteers to automatically annotate the SciELO data with the method from Lange et al. (2019).We use caseinsensitive string matching for PROTEINAS and strict string matching for the other types.This allows to create additional training instances, but at the same time introduces noise into the system.
To avoid that the noisy labels result in a decrease of performance, we train on the noisy data with a special noise handling method adapted from Goldberger and Ben-Reuven ( 2016), which will be described in the following.
Noisy Channel and Confusion Matrix.First, we annotate each word of the training data using the same method as for generating the noisy data.Thus, each word in the training data has a clean, true label y and a noisy label ŷ from which we can model the noise distribution p(ŷ = j|y = i) with a confusion matrix, as shown in Figure 3.We transform the distribution of the predicted (clean) labels to the noisy label distribution through a so-called noisy channel (Goldberger and Ben-Reuven, 2016): 2 WikiData identifiers used for the extraction: Q8047, Q7187, Q11364, Q8054, Q81915, Q37756, Q8066, Q79460, Q11358, Q177719, Q189720, Q11367, Q7946, Q28745, Q42962, Q2356542, Q47154513, Q172847, Q756, Q81163, Q134808.where k is the number of classes and p(y = i|x) is the probability of a label y having a specific class i given the feature x.
We initialize the noisy channel weights using the learned confusion matrix on the training set, for which clean and noisy labels are available.
Training with Confusion Matrix.The sequence tagging model is then trained alternately on the clean data with the CRF output layer and on the noisy data with the noisy channel layer, as shown in Figure 2. The number of noisy training instances is constantly decreased by 5% after every training epoch to at least 100 sentences, as we observed that the noisy data helps in particular for the first epochs, but decreases performance if the amount is not reduced.Note that we shuffle the noisy data after each training epoch.Thus, the model is trained on new samples of noisy sentences in every epoch.

Submissions
We submitted five runs to the PharmaCoNER competition.All of them are based on the architecture described in Section 2.1.S1 (Base): Our first run, the base system for all of the following runs, uses a concatenation of three embeddings (character, BPEmb, fast-Text) which were all trained on Wikipedia.

Results and Analysis
This section describes our results and analysis.

Experimental Results
In Table 1, we report the results on the Pharma-CoNER development and test sets using the official shared task evaluation metrics.Adding domain-knowledge (S2) to the base model S1 improves the performance on the development and the test set.The training on noisy data (S3) and the attention function alone (S4) do not lead to strong improvements on the test set; the noise model S3 even decreases performance.The combination of all proposed methods (run S5 At-tention+Noise) outperforms all other models.
While we are able to see the improvements step by step introduced by our methods on the development set, such improvements are not observable one-to-one on the test set.We assume that model S5 performs best at generalizing to unseen words due to the training on additional data and the attention function based on basic word properties like word length or frequency.The other models seem to overfit on the development set, even though this set was never used for training but only for early stopping.

Analysis of Attention Weights
The attention-based models learn to focus mostly on the byte-pair-encoding embeddings, as shown in Figure 4.In particular, for words from the general domain (positivas) and stopwords (para), our model focuses on these embeddings.For domainspecific words (antitiroglobulina, CAM5.2), the model learns to focus more on the fastText embeddings and especially the domain-specific embeddings.Interestingly, the character embeddings are never assigned a noticeable weight.This may be attributed to the fact that the other embeddings are all subword embeddings and that they are able to generate meaningful vectors for out-of-vocabulary words.Moreover, the character embeddings were randomly initialized and had to be learned during training while the other models were pretrained.

Conclusions
In this paper, we described our system for the first subtrack of the PharmaCoNER competition.We trained a bi-directional long short-term memory network and explored different input representations.We proposed to use a feature-based attention function for embedding selection and training on noisy data, which in combination increased performance by more than 3 F1 points up to 88.6%.This shows that we can successfully extract these special types of entities without the need for domain or language-specific model architectures.

Figure 2 :
Figure 2: Architecture of our models.The label prefixes "B-" and "I-" show how we address the task as a sequencelabeling task.The word representations are either the concatenated embeddings (in runs S1-S3) or the attentionbased weighted embeddings (in runs S4-S5).

Figure 3 :
Figure 3: Confusion matrix for the automatic annotation on the training data used for the noisy channel initialization.

Figure 4 :
Figure 4: The attention weights of our model for the four embeddings.Darker color indicates higher weight.