LEAN-LIFE: A Label-Efficient Annotation Framework Towards Learning from Explanation

Successfully training a deep neural network demands a huge corpus of labeled data. However, each label only provides limited information to learn from and collecting the requisite number of labels involves massive human effort. In this work, we introduce LEAN-LIFE, a web-based, Label-Efficient AnnotatioN framework for sequence labeling and classification tasks, with an easy-to-use UI that not only allows an annotator to provide the needed labels for a task, but also enables LearnIng From Explanations for each labeling decision. Such explanations enable us to generate useful additional labeled data from unlabeled instances, bolstering the pool of available training data. On three popular NLP tasks (named entity recognition, relation extraction, sentiment analysis), we find that using this enhanced supervision allows our models to surpass competitive baseline F1 scores by more than 5-10 percentage points, while using 2X times fewer labeled instances. Our framework is the first to utilize this enhanced supervision technique and does so for three important tasks -- thus providing improved annotation recommendations to users and an ability to build datasets of (data, label, explanation) triples instead of the regular (data, label) pair.


Introduction
Deep neural networks have achieved state-of-theart performance on a wide range of sequence labeling and classification tasks such as named entity recognition (NER) (Lample et al., 2016;Ma and Hovy, 2016), relation extraction (RE) (Zeng et al., 2015;Zhang et al., 2017;Ye et al., 2019), and sentiment analysis (SA) (Wang et al., 2016).However, they only yield such performance levels The increase is caused by the absorption of UV radiation by the oxygen and ozone.

RE
Cause-Effect Because the phrase "caused by" occurs between SUBJ and OBJ The burst has been caused by water hammer pressure UNLABELED SENTENCE

SUBJ OBJ SUBJ OBJ
Figure 1: Leveraging Labeling Explanations: 1) RE: the explanation "the phrase 'caused by' occurs between SUBJ and OBJ" can aid in weakly labeling unlabeled instances like "The burst has been caused by water hammer pressure" with the label "cause-effect"; 2) NER: Trigger spans near the labeled restaurant such as "had lunch at" and "where the food" can aid in weakly labeling unlabeled instances like "I had a dinner at Mc-Donalds, where the food is cheap".
in supervised learning scenarios, and in particular when human-annotated data is abundant.As we seek to apply NLP models to larger variety of domains, such as product reviews (Luo et al., 2018), social media messages (Lin et al., 2017), while reducing human annotation efforts, better annotation frameworks with label-efficient learning techniques are crucial to our progress.Annotation frameworks have been explored by several previous works (Stenetorp et al., 2012;Bontcheva et al., 2014;Morton and LaCivita, 2003;de Castilho et al., 2016;Yang et al., 2017).These existing open-source sequence annotation tools mainly focus on optimizing user-friendly user interfaces, such as providing shortcut key functionality to allow for faster tagging.The frameworks also attempt to provide annotation recommendation to reduce human annotation ef-forts.However, these recommendations are provided by a pre-trained model or via dictionary look-ups.This methodology of providing recommendations often proves to be unhelpful when little annotated data exists for pre-training, as is usually the case for natural language tasks being applied to domain-specific or user-provided corpora.
To resolve this issue, AlpacaTag, an annotation framework for sequence labeling (Lin et al., 2019) attempts to provide annotation recommendations from a learned sequence labeling model that is incrementally updated by batches of incoming human annotations.Its model training follows an active learning strategy (Shen et al., 2017), which is shown to be a label-efficient, thus it attempts to minimize human annotation efforts.AlpacaTag selects the most informative batches of documents for humans to annotate and thus achieves a more cost-effective way of using human efforts.While active learning allows the model to achieve higher performance earlier in the learning process, model performance could be improved if additional supervision existed.It is imperative that provided annotation recommendations be as accurate as possible, as inaccurate annotation recommendations from the framework can push users towards generating noisy data, hindering instead of aiding the model training process.
Our effort to prevent this problem is centered around allowing annotators to provide additional supervision by capturing labeling explanations, while still taking advantage of the costeffectiveness of active learning.Specifically, as shown in Fig. 1, we allow annotators to provide explanations for their decisions in natural language or by selecting triggersnearby phrases that provide helpful context for their decisions.These enhanced annotations allow for model training over both user-provided labels, as well as weakly labeled data created by parsing explanations into high precision labeling rules.We therefore make attempts to ameliorate the erroneous recommendation problem by a performance-boosting training strategy that incorporates both labeled and unlabeled data.
Our work is also similar to recent attempts that exploit explanations for an improved training process (Srivastava et al., 2017;Hancock et al., 2018;Zhou et al., 2020;Qin et al., 2020), but with two main differences.First, we embed this improved training process in a practical application and sec- ond, we design task specific architectures to incorporate the now captured explanations into training.
To the best of our knowledge, there is no existing open-source, easy-to-use, recommendationproviding, online-learning annotation framework that can also capture explanations.LEAN-LIFE is the first framework to capture and leverage explanations for improved model training and performance, while still inheriting the advantages of existing tools.We summarize our contributions as: • Improved Model Training: Our recommendation models use a performance improving training process that leverages explanations to weakly label unlabeled instances.Our models improve on competitive baseline F-1 scores by more than 5-10 percentage points, while using 2X less data.
• Multiple Supported Tasks: Our framework supports both sequence labeling (as in NER) and sequence classification (as in RE, SA).
• Explanation Dataset Creation: We make it easy to build a new type of dataset, one that consists of triples of: text, labels and labeling explanations.The exporting of this captured data is available in two common data formats, CSV and JSON.

System Overview
As shown in Fig. 2, our framework consists of two main components, a user-friendly web-UI that can capture labels and explanations for labeling decisions, and a weak supervision framework that parses explanations for the creation of weakly labeled data.The framework then uses this weakly labeled data in conjunction with user-provided labels to train models for improved annotation recommendations.Our UI shows annotators unlabeled instances (can be sampled using active learning), along with annotation recommendations in an effort to reduce annotation costs.We use Py-Torch to build our models and implement an API for communication between the web-UI and our weak supervision framework.The learned parameters of our framework are updated in an online fashion, thus improving in near real time.We will first touch on the annotation UI ( §3) and then go into our weak supervision framework ( §4).

UI for Capturing Human Explanation
The emphasis of our front-end design is to simplify the capture of both label and explanation for each labeling decision, while reducing annotation effort via accessible annotation recommendation.Our framework supports two forms of explanations, Triggers and Natural Language.A Trigger is a group of words in the sentence being annotated that aided the annotator's labeling decision, while Natural Language is a written explanation of the labeling decision.This section presents first the UI for capturing triggers ( §3.1) and then the UI for capturing natural language explanations ( §3.2).

Capturing Triggers
Fig. 3 illustrates how our framework can capture both a named entity (NE) label and triggers for the sentence "We had a fantastic lunch at Rumble Fish yesterday where the food is my favorite".The user is first presented with a piece of text to annotate (Annotating Section), the available labels that may be applied to sub-sequences (spans) of text (in the blue header) and recommendations of what spans of text should be considered as NE mentions (Named Entity Recommendation Section).The user may choose to select a span of text to label, or they may click on one of the recommended spans below (Fig. 2a).If the user clicks on a recommended span, a small pop-up displaying the available labels appear with the recommended label circled in red (Fig. 2a).Once the user selects a label for a span of text by either clicking on the desired label button or via a predefined shortcut key (ex: for Restaurant the shortcut key is r), a pop-up appears (Fig. 2b), asking the user to select helpful spans (triggers) from the text that provide useful context in deciding the label for the NEMmultiple triggers may be selected.The user may cancel their decision to label a span of text with a label by clicking the x button in the pop-up, but if the user wants to proceed and has selected at least one trigger, they finish the labeling by hitting done.Then, their label is visualized in the Annotating Section by highlighting the NEM.
(b) after clicking a label to assign to a text span, a pop up appears asking the user to explain their decision by selecting nearby "trigger" text spans.
(a) the labels appear in the header, followed by an annotating section; tagging suggestions are shown as underlined spans at the bottom of the page.A user may hover over a tagging suggestion or select a span in order to apply a label to a substring.

Capturing Natural Language
Fig. 4 illustrates how for the sentence "Tahawwur Hussain Rana who was born in Pakistan but is a Canadian citizen" our framework can capture both a relation label between NEs and the subsequent natural language explanation.First, the user is tasked to find the NEs in the sentence.After labeling at least two non-consecutive spans of text as NEs, the user may check off the boxes that appear above the labeled NEs.Once two boxes have been checked off, the labels in the blue header are replaced with the labels for relations.The clickorder of the checked boxes is displayed and is considered the order of the relation.Also, we display a recommend label to the user in the header section with a circle (Fig. 2a).After clicking on a label, a pop-up appears asking the user to indicate semantic and syntactic reasons as to why the labeling decision is true.Since the natural language explanations are assumed to be made up of predefined predicates, as the user types we incrementally provide predicates to aid the construction of an explanation (Fig. 2b).In this way, we nudge users towards writing explanations the semantic parser is able to break down, allowing our framework to extract a useful logical form from the explanation.

LEAN-LIFE Framework
Our Weak Supervision Framework is composed of two main components, a weak labeling module that parses explanations to create labeling rules and a downstream model.The framework parses user-provided explanations to generate weakly labeled data and then trains the appropriate downstream model with this augmented training data.Our weak labeling module supports both explanation formats provided to the annotator in the UItriggers and natural language.This section first introduces how the module utilizes triggers ( §4.1) and then presents how the module deals with natural language( §4.2).

Input: Trigger
When a trigger is inputted into the system, we generate weak labels for our training data via softmatching between trigger representations and unlabeled sentences (Lin et al., 2020).Each sentence may contain one or more triggers, but each trigger is associated with only one label.Our framework jointly learns a mapping between triggers and their label using a linear layer with a soft-max output and a log-likelihood loss, as well as the semantic similarity between the triggers and their associated sentences using contrastive losswe weigh both objectives equally.Through this joint learning, our trigger representations can capture label knowledge as well as semantic information.We use these representations to improve model training by generating weakly labeled data via soft matching on the unlabeled sentences.More specifically, for each unlabeled sentence, we first calculate the semantic similarity between the sentence and all collected triggers and then filter out all triggers where the similarity distance is larger than our fixed threshold.We then generate a trigger-aware sentence encoding for each threshold-passing trigger and feed these encodings into a downstream classifier for label inference.Finally, we conduct majority vote over outputted label sequences to finalize our weak labels for the unlabeled sentence.In this manner we are able to train over more data, where a good portion of it is weakly labeled.

Input: Natural Language
When natural language is inputted into the system, our module grows training data via soft-matching between logical forms parsed from natural language explanations and unlabeled sentences.The module follows the Neural Execution Tree framework of (Qin et al., 2020) when dealing with natural language.First, the explanation is parsed into a logical form by a semantic parser.Previous works have suggested using similar logical forms to improve model training by strict matching on the pool of unlabeled sentences to generate additional labeled data.However, (Qin et al., 2020) proposes an improved model training paradigm, which relaxes this strict matching constraint, subsequently improving weak labeling coverage and allowing for a larger pool of unlabeled data to be used for model training.Our module does assume each NL explanation can be broken down into a logical form composed of clauses consisting of predicates from four categorieshence the auto-suggest feature in the UI.At weak labeling time, the module scores how likely a given unlabeled sentence fits each clause and then constructs an aggregate score representing the match between the logical form and the unlabeled sentence.If the final score is above configurable thresholds, we weakly label indicating the similarity between each token w i and the keyword q"happy" in Fig. 5. Our Distant Counting Module aims to relax the distance constraint stated in the explanation, ex: "by no more than 5 words".If the position of keyword q strictly satisfies the constraint, the score is set to 1, otherwise the score decreases as the constraint is less satisfied.Finally, the Deterministic Function Module deals with deterministic predicates like "LEFT", "BETWEEN", which can only be exactly matched in terms of the keyword q.Scores are the aggregated by the Logical Calculation Module to output a final relevancy score.

Experiments
We conduct extensive experiments investigating label efficiency to prove the effectiveness of our annotation models.We found that using natural language explanations for RE and SA, and trigger explanations for NER provided the best results.For the downstream model portion of our weak supervision framework, we use common supervised method for each task: (1-RE) BLSTM+ATT (Bahdanau et al., 2014) adds an attention layer onto LSTM to encode an sequence.
(2-SA) ATAE-LSTM (Wang et al., 2016)  bines the aspect term information into both the embedding layer and attention layer to help the model concentrate on different parts of a sentence.
(3-NER) BLSTM+CRF (Ma and Hovy, 2016) encodes character sequences into a vector and concatenates the vector with pre-trained word embeddings to feed into word-level BLSTM.Then, it applies a CRF layer to predict sequence labels.Then we compare these methods as baselines.
Tasks and Datasets We test our implementation on three tasks: RE, SA, NER.We use TA-CRED (Zhang et al., 2017) for RE, Restaurant review from SemEval 2014 Task 4 for SA, and Laptop reviews (Pontiki et al., 2016) for NER.
Label Efficiency We claim that when starting with little to no labeled data, it is more effective to ask annotators to provide a label and an explanation for the label, than to just request a label.To support this claim, we conduct experiments to demonstrate the label efficiency of our explanation-leveraging-model.We found that the time for labeling one instance plus providing an explanation takes 2X times more time than just simply providing a label.Given this annotation time observation, we compare the performance between our improved training process and the traditional label-only training process by holding annotation time constant between the two trials.
This means we expose the label-only supervised model to the appropriate multiple of labeled instances that the label-and-explanation supervised model is shown Fig. 6.Each marker on the x-axis of the plots indicate a certain interval of annotation time, which is represented by the number of label+explanations our augmented model training paradigm is given vs. how many labels the traditional label-only model training is shown.We use the commonly used F-1 metric to compare the performances.As shown in Fig. 6, we see that our model not only is more time and label efficient than the traditional label-only training process, but it also outright outperforms the label-only training process.Given these results, we believe it is worth to request a user to provide both a label and an explanation for the label.Not only does the improvement in performance justify the extra time required to provide the explanation, but we also can achieve higher performance with fewer datapoints / less annotation time.

Related Works
Leveraging natural language explanations for additional supervision has been explored by many works.(Srivastava et al., 2017) first demonstrated the idea of using natural language explanations for weak labeling by jointly training a task-specific semantic parser and label classifier to generate weak labels.This method is limited though, as the parser is too tightly coupled to the already labeled data, thus their weak learning framework is not able to build a much larger dataset than the one it already has.To address this issue, (Hancock et al., 2018) proposed a weak supervision framework that utilizes a more practical rule-based semantic parser.The parser constructs a logical form for an explanation that is then used as a labeling functionthis resulted in a significant increase of the training set.Another effort to incorporate explanations can be found in (Camburu et al., 2018) work to extend the Stanford Natural Language Inference dataset with natural language explanationsthis extension was done for the important textual entailment recognition task.They demonstrate the usefulness of explanations as an additional training signal for learning more comprehensive sentence representations.Even earlier (Andreas et al., 2016) explored breaking down natural language explanation into linguistic sub-structures for learning collections of neural modules which can be as-sembled into neural networks.Our framework is very related to the above weak supervision methods via explanation.Another approach to weak supervision is attempting to transfer knowledge from a related source to the target domain corpus (Lin and Lu, 2018;Lan et al., 2020).Ni et al. (2017) attempts to create weakly labeled NER data for a target language via an annotation projection from a comparable corpus.However, their efforts regard unlabeled words as 'O', and so it cannot deal with incomplete annotationsa feature an annotation framework must handle.Shang et al. (2018) and Yang et al. (2018) proposed using a domainspecific dictionary for matching on the unannotated target corpus.Both efforts employ Partial CRFs (Liu et al., 2014) which assign all possible labels to unlabeled words and maximize the total probability.This approach addresses the incomplete annotation problem, but heavily relies on a domain-specific seed dictionary.

Conclusion
In this paper, we propose an open-source webbased annotation framework LEAN-LIFE that not only allows an annotator to provide the needed labels for a task, but can also capture explanation for each labeling decision.Such explanations enable a significant improvement in model training while only doubling per instance annotation time.This increase in per instance annotation time is greatly outweighed by the benefits in model training, especially in a low resource settings, as proven by our experiments.This is an important consideration for any annotation framework, as the quicker the framework is able to train annotation recommendation models to reach high performance, the sooner the user receives useful annotation recommendations, which in turn cut down on the annotation time required per instance.
Better training methods also allow us to fight the potential generation of noisy data due to inaccurate annotation recommendations.We hope that our work on LEAN-LIFE will allow for researches and practitioners alike to more easily obtain useful labeled datasets and models for the various NLP tasks they face.
dinner at McDonalds, where the food is cheap because the word price is directly preceded by fair

Figure 3 :
Figure 3: The workflow for annotators to annotate a NE label and trigger span.(Rumble Fish as Restaurant).

( a )
Figure 4: The workflow for annotators to annotate a relation label and NL explanation.(per:nationality as relation label between "Tahawwur Hussain Rana" and "Canadian").

Figure 5 :
Figure 5: Weakly labeling module for exploiting natural language explanation.the keyword is 'happy'