Zero-Shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens

Can attention- or gradient-based visualization techniques be used to infer token-level labels for binary sequence tagging problems, using networks trained only on sentence-level labels? We construct a neural network architecture based on soft attention, train it as a binary sentence classifier and evaluate against token-level annotation on four different datasets. Inferring token labels from a network provides a method for quantitatively evaluating what the model is learning, along with generating useful feedback in assistance systems. Our results indicate that attention-based methods are able to predict token-level labels more accurately, compared to gradient-based methods, sometimes even rivaling the supervised oracle network.


Introduction
Sequence labeling is a structured prediction task where systems need to assign the correct label to every token in the input sequence. Many NLP tasks, including part-of-speech tagging, named entity recognition, chunking, and error detection, are often formulated as variations of sequence labeling. Recent state-of-the-art models make use of bidirectional LSTM architectures (Irsoy and Cardie, 2014), character-based representations (Lample et al., 2016), and additional external features (Peters et al., 2017). Optimization of these models requires appropriate training data where individual tokens are manually labeled, which can be time-consuming and expensive to obtain for each different task, domain and target language.
In this paper, we investigate the task of performing sequence labeling without having access to any training data with token-level annotation. Instead of training the model directly to predict the label for each token, the model is optimized using a sentence-level objective and a modified version of the attention mechanism is then used to infer labels for individual words.
While this approach is not expected to outperform a fully supervised sequence labeling method, it opens possibilities for making use of text classification datasets where collecting token-level annotation is not possible or cost-effective.
Inferring token-level labels from a text classification network also provides a method for analyzing and interpreting the model. Previous work has used attention weights to visualize the focus of neural models in the input data. However, these analyses have largely been qualitative examinations, looking at only a few examples from the datasets. By formulating the task as a zero-shot labeling problem, we can provide quantitative evaluations of what the model is learning and where it is focusing. This will allow us to measure whether the features that the model is learning actually match our intuition, provide informative feedback to end-users, and guide our development of future model architectures.

Network Architecture
The main system takes as input a sentence, separated into tokens, and outputs a binary prediction as the label of the sentence. We use a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) architecture for sentence classification, with dynamic attention over words for constructing the sentence representations. Related architectures have been successful for machine translation (Bahdanau et al., 2015), sentence summarization (Rush and Weston, 2015), entailment detection (Rocktäschel et al., 2016), and error correction (Ji et al., 2017). In this work, we modify the attention mechanism and training objective in order to make the resulting network suitable for also inferring binary token labels, while still performing well as a sentence classifier. Figure 1 contains a diagram of the network architecture.
The tokens are first mapped to a sequence of word representations [w 1 , w 2 , w 3 , ..., w N ], which are constructed as a combination of regular word embeddings and character-based representations, following Lample et al. (2016). These word representations are given as input to a bidirectional LSTM which iteratively passes through the sentence in both directions. Hidden representations from each direction are concatenated at every token position, resulting in vectors h i that are focused on a specific word but take into account the context on both sides of that word. We also include a transformation with tanh activation, which helps map the information from both directions into a joint feature-space: where W h is a parameter matrix and b h is a parameter vector, optimized during training. Next, we include an attention mechanism that allows the network to dynamically control how much each word position contributes to the combined representation. In most attention-based systems, the attention amount is calculated in reference to some external information. For example, in machine translation the attention values are found based on a representation of the output that has already been generated (Bahdanau et al., 2015); in question answering, the attention weights are calculated in reference to the input question (Hermann et al., 2015). In our task there is no external information to be used, therefore we predict the attention values directly based on h i , by passing it through a separate feedforward layer: where W e , b e , W e and b e are trainable parameters and e i results in a single scalar value. This method is equivalent to calculating the attention weights in reference to a fixed weight vector, which is optimized during training. Shen and Lee (2016) proposed an architecture for dialogue act detection where the attention values are found based on a separate set of word embeddings. We found that the method described above was consistently equivalent or better in development experiments, while requiring a smaller number of parameters. The values of e i are unrestricted and should be normalized before using them for attention, to avoid sentences of different length having representations of different magnitude. The common approach is to use an exponential function to transform the value, and then normalize by the sum of all values in the sentence: The value a i is now in a range 0 ≤ a i ≤ 1 and higher values indicate that the word at position i is more important for predicting the sentence class. The network learns to predict informative values for a i based only on the sentence objective, without receiving token-level supervision. Therefore, we can use these attention values at each token in order to infer an unsupervised sequence labeling output.
The method in Equation 6 is well-suited for applications such as machine translation -the exponential function encourages the attention to prioritize only one word in the sentence, resulting in a word-word alignment. However, the same function is less suitable for our task of unsupervised sequence labeling, as there is no reason to assume that exactly one word has a positive label. An input sentence can contain more than one tagged token, or it can contain no tokens of interest, and this should be reflected in the predictions.
Instead of the exponential function, we make use of the logistic function σ for calculating soft attention weights: where each a i has an individual value in the range 0 ≤ a i ≤ 1 and a i is normalized to sum up to 1 over all values in the sentence. The normalized weights a i are used for combining the contextconditioned hidden representations from Equation 3 into a single sentence representation: In addition, we can use the pre-normalization value a i as a score for sequence labeling, with a natural decision boundary of 0.5 -higher values indicate that the token at position i is important and should be labeled positive, whereas lower values suggest the token is largely ignored for sentence classification and can receive a negative label. Attention weights with sigmoid activation have been shown to also improve performance on classification tasks (Shen and Lee, 2016), which indicates that this architecture has the benefit of being both accurate and interpretable on the token level. Finally, we pass the sentence representation c through a feedforward layer and predict a binary label for the overall sentence: where d is a sentence vector and y is a single value between 0 ≤ y ≤ 1, with values higher than 0.5 indicating a positive class and lower values indicating a negative prediction.
In order to optimize the model, we use several different loss functions. The first is the squared loss which optimizes the sentence-level score prediction to match the gold label in the annotation: where y (j) is the predicted score for the j-th sentence, and y (j) is the true binary label (0, 1) for the j-th sentence.
In addition, we want to encourage the model to learn high-quality token-level labels as part of the attention weights. While the model does not have access to token-level annotation during training, there are two constraints that we can take advantage of: 1. Only some, but not all, tokens in the sentence can have a positive label.
2. There are positive tokens in a sentence only if the overall sentence is positive.
We can then construct loss functions that encourage the model to optimize for these constraints: where min i ( a i ) is the minimum value of all the attention weights in the sentence and max i ( a i ) is the corresponding maximum value. Equation 12 optimizes the minimum unnormalized attention weight in a sentence to be 0, satisfying the constraint that all tokens in a sentence should not have a positive token-level label. Equation 13 then optimizes for the maximum unnormalized attention weight in a sentence to be equal to the gold label for that sentence, which is either 0 or 1, incentivizing the network to only assign large attention weights to tokens in positive sentences. These objectives do not provide the model with additional information, but serve to push the attention scores to a range that is suitable for binary classification. We combine all of these loss objectives together for the main optimization function: where γ is used to control the importance of the auxiliary objectives.

Alternative Methods
We compare the attention-based system for inferring sequence labeling with 3 alternative methods.

Labeling Through Backpropagation
We experiment with an alternative method for inducing token-level labels, based on visualization methods using gradient analysis. Research in computer vision has shown that interpretable visualizations of convolutional networks can be obtained by analyzing the gradient after a single backpropagation pass through the network (Zeiler and Fergus, 2014). Denil et al. (2014) extended this approach to natural language processing, in order to find and visualize the most important sentences in a text. Recent work has also used the gradient-based approach for visualizing the decisions of text classification models on the token level (Li et al., 2016;Alikaniotis et al., 2016). In this section we propose an adaptation that can be used for sequence labeling tasks. We first perform a forward pass through the network and calculate the predicted sentence-level score y. Next, we define a pseudo-label y * = 0, regardless of the true label of the sentence. We then calculate the gradient of the word representation w i with respect to the loss function using this pseudo-label: where L 1 is the squared loss function from Equation 11. The magnitude of g i , |g i | can now be used as an indicator of how important that word is for the positive class. The intuition behind this approach is that the magnitude of the gradient indicates which individual words need to be changed the most in order to make the overall label of the sentence negative. These are the words that are contributing most towards the positive class and should be labeled as such individually. An obstacle in using this score for sequence labeling comes from the fact that there is no natural decision boundary between the two classes. The magnitude of the gradient is not constrained to a specific range and can vary quite a bit depending on the sentence length and the predicted sentencelevel score. In order to map this magnitude to a decision, we analyze the distribution of magnitudes in a sentence. Intuitively, we want to detect outliers -scores that are larger than expected. Therefore, we map all the magnitudes in a sentence to a Gaussian distribution and set the decision boundary at 1.5 standard deviations. Any word that has a gradient magnitude higher than that will be tagged with a positive class for sequence labeling. If all the magnitudes in a sentence are very similar, none of them will cross this threshold and therefore all words will be labeled as negative.
We calculate the gradient magnitude using the same network architecture as described in Section 2, at word representation w i after the character-based features have been included. The attention-based architecture is not necessary for this method, therefore we also report results using a more traditional bidirectional LSTM, concatenating the last hidden states from both directions and using the result as a sentence representation for the main objective.

Relative Frequency Baseline
The system for producing token-level predictions based on sentence-level training data does not necessarily need to be a neural network. As the initial experiment, we trained a Naive Bayes classifier with n-gram features on the annotated sentences and then used it to predict a label only based on a window around the target word. However, this did not produce reliable results -since the classifier is trained on full sentences, the distribution of features is very different and does not apply to a window of only a few words.
Instead, we calculate the relative frequency of a feature occurring in a positive sentence, normalized by the overall frequency of the feature, and calculate the geometric average over all features that contain a specific word: 296 where c(X k = 1, Y = 1) is the number of times feature k is present in a sentence with a positive label, F i is the set of n-gram features present in the sentence that involve the i-th word in the sentence, and score i is the token-level score for the i-th token in the sentence. We used unigram, bigram and trigram features, with extra special tokens to mark the beginning and end of a sentence. This method will assign a high score to tokens or token sequences that appear more often in sentences which receive a positive label. While it is not able to capture long-distance context, it can memorize important keywords from the training data, such as modal verbs for uncertainty detection or common spelling errors for grammatical error detection.

Supervised Sequence Labeling
Finally, we also report the performance of a supervised sequence labeling model on the same tasks. This serves as an indicator of an upper bound for a given dataset -how well the system is able to detect relevant tokens when directly optimized for sequence labeling and provided with token-level annotation.
We construct a bidirectional LSTM tagger, following the architectures from Irsoy and Cardie (2014), Lample et al. (2016) and Rei (2017). Character-based representations are concatenated with word embeddings, passed through a bidirectional LSTM, and the hidden states from both direction are concatenated. Based on this, a probability distribution over the possible labels is predicted and the most probable label is chosen for each word. While Lample et al. (2016) used a CRF on top of the network, we exclude it here as the token-level scores coming from that network do not necessarily reflect the individual labels, since the best label sequence is chosen globally based on the combined sentence-level score. The supervised model is optimized by minimizing cross-entropy, training directly on the token-level annotation.

Datasets
We evaluate the performance of zero-shot sequence labeling on 3 different datasets.
In each experiment, the models are trained using only sentence-level annotation and then evaluated based on token-level annotation.

CoNLL 2010 Uncertainty Detection
The CoNLL 2010 shared task (Farkas et al., 2010) investigated the detection of uncertainty in natural language texts. The use of uncertain language (also known as hedging) is a common tool in scientific writing, allowing scientists to guide research beyond the evidence without overstating what follows from their work. Vincze et al. (2008) showed that 19.44% of sentences in the biomedical papers of the BioScope corpus contain hedge cues. Automatic detection of these cues is important for downstream tasks such as information extraction and literature curation, as typically only definite information should be extracted and curated.
The dataset is annotated for both hedge cues (keywords indicating uncertainty) and scopes (the area of the sentence where the uncertainty applies). The cues are not limited to single tokens, and can also consist of several disjoint tokens (for example, "either ... or ..."). An example sentence from the dataset, with bold font indicating the hedge cue and curly brackets marking the scope of uncertainty: Although IL-1 has been reported to contribute to Th17 differentiation in mouse and man, it remains to be determined {whether therapeutic targeting of IL-1 will substantially affect IL-17 in RA}.
The first subtask in CoNLL 2010 was to detect any uncertainty in a sentence by predicting a binary label. The second subtask required the detection of all the individual cue tokens and the resolution of their scope. In our experiments, we train the system to detect sentence-level uncertainty, use the architecture to infer the token-level labeling and evaluate the latter on the task of detecting uncertainty cues. Since the cues are defined as keywords that indicate uncertainty, we would expect the network to detect and prioritize attention on these tokens. We use the train/test data from the second task, which contains the token-level annotation needed for evaluation, and randomly separate 10% of the training data for development.

FCE Error Detection
Error detection is the task of identifying tokens which need to be edited in order to produce a grammatically correct sentence. The task has numerous applications for writing improvement and assessment, and recent work has focused on error detection as a supervised sequence labeling task Kaneko et al., 2017;Rei, 2017). Error detection can also be performed on the sentence level -detecting whether the sentence needs to be edited or not. Andersen et al. (2013) described a practical tutoring system that provides sentence-level feedback to language learners. The 2016 shared task on Automated Evaluation of Scientific Writing (Daudaravicius et al., 2016) also required participants to return binary predictions on whether the input sentence needs to be corrected. We evaluate our system on the First Certificate in English (FCE, Yannakoudakis et al. (2011)) dataset, containing error-annotated short essays written by language learners. While the original corpus is focused on aligned corrections,  converted the dataset to a sequence labeling format, which we make use of here. An example from the dataset, with bold font indicating tokens that have been annotated as incorrect given the context: When the show started the person who was acting it was not Danny Brook and he seemed not to be an actor.
We train the network as a sentence-level error detection system, returning a binary label and a confidence score, and also evaluate how accurately it is able to recover the locations of individual errors on the token level.

SemEval Sentiment Detection in Twitter
SemEval has been running a series of popular shared tasks on sentiment analysis in text from social media (Nakov et al., 2013;Rosenthal et al., 2014Rosenthal et al., , 2015. The competitions have included various subtasks, of which we are interested in two: Task A required the polarity detection of individual phrases in a tweet, and Task B required sentiment detection of the tweet as a whole. A single tweet could contain both positive and negative phrases, regardless of its overall polarity, and was therefore separately annotated on the tweet level.
In the following example from the dataset, negative phrases are indicated with a bold font and positive phrases are marked with italics, whereas the overall sentiment of the tweet is annotated as negative: They may have a SuperBowl in Dallas, but Dallas ain't winning a SuperBowl. Not with that quarterback and owner. @S4NYC @RasmussenPoll Sentiment analysis is a three-way task, as the system needs to differentiate between positive, negative and neutral sentences. Our system relies on a binary signal, therefore we convert this dataset into two binary tasks -one aims to detect positive sentiment, the other focuses on negative sentiment. We train the system as a sentiment classifier, using the tweet-level annotation, and then evaluate the system on recovering the individual positive or negative tokens. We use the train/dev/test splits of the original SemEval 2013 Twitter dataset, which contains phrase-level sentiment annotation.

Implementation Details
During pre-processing, tokens are lowercased while the character-level component still retains access to the capitalization information. Word embeddings were set to size 300, pre-loaded from publicly available Glove (Pennington et al., 2014)   The model was implemented using Tensorflow (Abadi et al., 2016). The network weights were randomly initialized using the uniform Glorot initialization method (Glorot and Bengio, 2010) and optimization was performed using AdaDelta (Zeiler, 2012) with learning rate 1.0. Dropout (Srivastava et al., 2014) with probability 0.5 was applied to word representations w i and the composed representations h i after the LSTMs. The training was performed in batches of 32 sentences. Sentence-level performance was observed on the development data and the training was stopped if performance did not improve for 7 epochs. The best overall model on the development set was then used to report performance on the test data, both for sentence classification and sequence labeling. In order to avoid random outliers, we performed each experiment with 5 random seeds and report here the averaged results.
The code used for performing these experiments is made available online. 1

Evaluation
Results for the experiments are presented in Tables 1 and 2. We first report the sentence-level F-measure in order to evaluate the performance on the general text classification objective. Next, we report the Mean Average Precision (MAP) at returning the active/positive tokens. This measure quency achieve high recall values, but comparatively lower precision. On the FCE dataset, the F-score is considerably lower at 28.27% -this is due to the difficulty of the task and the supervised system also achieves only 34.76%. The attentionbased system outperforms the alternatives on both of the SemEval evaluations. The task of detecting sentiment on the token level is quite difficult overall as many annotations are context-specific and require prior knowledge. For example, in order to correctly label the phrase "have Superbowl" as positive, the system will need to understand that organizing the Superbowl is a positive event for the city. Performance on the sentence-level classification task is similar for the different architectures on the CoNLL 2010 and FCE datasets, whereas the composition method based on attention obtains an advantage on the SemEval datasets. Since the latter architecture achieves competitive performance and also allows for attention-based token labeling, it appears to be the better choice. Analysis of the token-level MAP scores shows that the attention-based sequence labeling model achieves the best performance even when ignoring classification thresholds and evaluating the task through ranking. Figure 2 contains example outputs from the attention-based models, trained on each of the four datasets. In the first example, the uncertainty detector correctly picks up "would appreciate if" and "possible", and the error detection model focuses most on the misspelling "Definetely". Both the positive and negative sentiment models have assigned a high weight to the word "disappointing", which is something we observed in other examples as well. The system will learn to focus on phrases that help it detect positive sentiment, but the presence of negative sentiment provides implicit evidence that the overall label is likely not positive. This is a by-product of the 3-way classification task and future work could investigate methods for extending zero-shot classification to better match this requirement.
In the second example, the system correctly labels the phrase "what would be suitable?" as uncertain, and part of the phrase "I'm not really sure" as negative. It also labels "specifying" as an error, possibly expecting a comma before it. In the third example, the error detection model labels "Internet" for the missing determiner, but also captures a more difficult error in "depended", which is an incorrect form of the word given the context.

Conclusion
We investigated the task of performing sequence labeling without having access to any training data with token-level annotation. The proposed model is optimized as a sentence classifier and an attention mechanism is used for both composing the sentence representations and inferring individual token labels. Several alternative models were compared on three tasks -uncertainty detection, error detection and sentiment detection.
Experiments showed that the zero-shot labeling system based on attention weights achieved the best performance on all tasks. The model is able to automatically focus on the most salient areas of the sentence, and additional objective functions along with the soft attention mechanism encourage it to also perform well as a sequence labeler. The zero-shot labeling task can provide a quantitative evaluation of what the model is learning, along with offering a low-cost method for creating sequence labelers for new tasks, domains and languages.