FITAnnotator: A Flexible and Intelligent Text Annotation System

In this paper, we introduce FITAnnotator, a generic web-based tool for efficient text annotation. Benefiting from the fully modular architecture design, FITAnnotator provides a systematic solution for the annotation of a variety of natural language processing tasks, including classification, sequence tagging and semantic role annotation, regardless of the language. Three kinds of interfaces are developed to annotate instances, evaluate annotation quality and manage the annotation task for annotators, reviewers and managers, respectively. FITAnnotator also gives intelligent annotations by introducing task-specific assistant to support and guide the annotators based on active learning and incremental learning strategies. This assistant is able to effectively update from the annotator feedbacks and easily handle the incremental labeling scenarios.


Introduction
Manually-labeled gold standard annotations are the first prerequisite for the training and evaluation of modern Natural Language Processing (NLP) methods. With the development of deep learning, neural networks have achieved state-of-the-art performance in a variety of NLP fields. These impressive achievements rely on large-scale training data for supervised training. However, building annotation requires a significant amount of human effort and incurs high costs, and can place heavy demands on human annotators for maintaining annotation quality and consistency.
To improve annotation productivity and reduce the financial cost of annotation, many text annotation softwares are developed by constraining user actions and providing an effective interface. In the early days, platforms for linguistic annotations such as O'Donnell (2008), BART (Stenetorp et al., 2012), WebAnno-13 (Yimam et al., 2013) mainly focused on providing a visual interface for user labeling process, making annotation accessible to non-expert users. Recently, integrating active learning into annotation systems for providing suggestions to user has became mainstream (TextPro (Pianta et al., 2008), WebAnno-14 (Yimam et al., 2014), Active DOP (van Cranenburgh, 2018), IN-CEpTION (Klie et al., 2018), etc), but most of these works focus on English text and rarely consider the multi-lingual setting, which is necessary due to the growing demand for annotation in other languages. In addition to the interface and efficiency, incremental annotation is also necessary in realworld scenarios since the pre-defined annotation standards and rules cannot handle rapidly emerging novel classes in the real world, while being less addressed in existing annotation tools.
To address the challenges above, we propose FITAnnotator, a generic web-based tool for text annotation, which fulfills the following requirements: • Extremely flexible and configurable: our system architecture is fully modular, even the user interface is a replaceable module. Which means it is model-agnostic and supports annotation on a variety of linguistic tasks, including tagging, classification, parsing, etc.
• Active learning: learning from small amounts of data, and selecting by itself what data it would like the user to label from an unlabeled dataset. Annotators label these selected instances and add them to the training set. A new model is automatically trained on the updated training set. This process repeats and results in dramatic reductions in the amount of labeling required to train the NLP model.
• Expansible data provider: the previous annotation tools are compatible with the static corpus for annotation, which is not convenient for annotating from sketch and expansion. FI-TAnnotator sets up an independent data loader and data provider, which can continuously import data to the corpus in bulk. The flexible data provider also brings new problems, such as dynamic labeling schema, which should be solved by incremental learning.
• Incremental learning: creating a prototype for each category and enabling the prototypes of the novel categories far from the prototypes of the original categories while maintaining features to cluster near the corresponding category prototypes, which makes the tool suitable for annotating with new classes added incrementally.
• Collaboration & crowdsourcing: the system is designed for the multi-user scenario, where multiple annotators can work collaboratively at the same time. When multiple users cooperate in annotation, the dismountable crowdsourcing algorithm interface can be used to allocate overlapping data in apiece task packages, for evaluating the annotation quality of each user. Also, the system provides a manual review interface, which can perform sampling inspection and evaluation on various users' annotation.
Figure 1 reflects our design philosophy and comprehension of the interaction between the three major elements in our annotation system.

Related Works
In recent years, the NLP community has developed several annotation tools (Neves and Ševa, 2019). Yedda (Yang et al., 2018b) provides an easy-to-use and lightweight GUI software for collaborative text annotation, and provides certain administrator analysis for evaluating multi-annotators. FLAT 2 introduces generalised paradigm and well-defined annotation format defined in folia (van Gompel, 2012), and provides web-based annotation interface. Doccano (Nakayama et al., 2018) is an open-source, web-based text annotation tool that provides collaboration, intelligent recommendation functions, and includes a user-friendly annotation interface. INCEpTION ) is a comprehensive text annotation system, which is also web-based and open-source, integrates active learning algorithms and provides various interfaces for different annotation tasks, and it is developing for more tasks , more convenient  and low-resource scenarios (Klie et al., 2020).
In addition, commercial annotation tools such as prodigy 3 , tagtog 4 , LightTag 5 also provide powerful active learning support, team-collaboration functions, efficient user interfaces, and provide more related commercial solutions, which have gained appreciable business achievement.
All of these intelligent text annotation tools have several common features: supporting active learning and a rich variety of tasks. And commercial annotation tools pay more attention to user experience and collaboration.

Architecture
The architecture of FITAnnotator is influenced by the ideas of functional programming and, in particular, by the desire to combine functional with object-oriented programming. The adherence to the programming principles such as immutability and modularity, FITAnnotator is developed by hybrid programming language Python. An overview of our system is shown in Figure 2, which has four main modules: 1. core module controls all data flow and provides the gateway for other modules. Tasks and projects are stored in the database of this module, and there are some fields to specify the URI of each related module. The system is based on these URIs to transfer and process data between modules. This module also  Figure 2: The overall architecture of the system provides an administrator control panel for managing the system and database.
2. data-loader module contains fundamental tokenizer and data-loader of specific machine learning model. By deploy multifarious data-loader module with different tokenizers, we can adapt this system to different languages and tasks. In addition, we also provide data expansion function in this module. Expanded data would be cleaned in this module and passed to core module.
3. intelligent annotation module acts as the assistant which provides a pre-built machine learning model according to the type of tasks. This model could be simple as FastText (Joulin et al., 2017) or complex as BERT (Devlin et al., 2019). With such a model, we can obtain automatic labeling results for unlabeled data, and calculate their ranking scores according to the active learning strategy. By reordering the unlabeled data before pushing them to annotators, the annotation speed could be accelerated. Besides, incremental learning is also implemented in this module. We describe the details of this module in Section 4. 4. interface module contains three separate web interfaces: annotator, reviewer and administrator. The annotator interface presents the ranked unlabeled instances based on the recommendation score provided by the active learning module. Upon annotating a new sentence, the annotator is presented with the most probable labels recommended by the active learning model (see Figure 4). When the anno-tators make a decision for confirming model recommendation or altering the labels, the operations will be fed back to the backend system and update the parameters of the active learning model. In the reviewer interface, the users monitor the progress of the annotation and see statistics such as the number of annotated instances, and the remaining unlabeled data. The reviewers can also review these already annotated instances and introduce corrections if necessary. In the administrator interface (shown in Figure 3), the project manager defines the annotation standards and sets all parameters for the annotation process, including the configures of active learning models, the management of annotators and reviewers, the assignment of tasks and so on.
The system is written with a modular design intended to be easily modifiable. Modules and interfaces (except core module and administrator interface) can be replaced easily for specific requirements. The flexibility it easy to adapt to multiple tasks and languages. FITAnnotator has three built-in annotation templates now: text classification, sequence tagging and semantic structure annotation, which cover most common NLP tasks, including sentence classification, sentence pair matching, named entity recognition and semantic role annotation. Users can also migrate to other tasks through simple modification of the framework.

Intelligent Annotation
Creating high-quality annotated corpora is a laborious process and requires experts who are highly familiar with the annotation schemes and stan-  dards. To accelerate the annotation process, we introduce the intelligent assistant that incorporates task-specific neural networks which actively assist and guide annotators. The cores of intelligent annotation are two adaptive learning mechanisms: active learning and incremental learning.

Active Learning
A framework where a model learns from small amounts of data, and optimizes the selection of the most informative or diverse sample to annotate in order to maximize training utility value, is referred to as active learning (Gal et al., 2017;Schröder and Niekler, 2020). In particular, we employ a fused active learning method as a default strategy for evaluating, re-ranking and re-sampling data, which considers uncertainty and diversity at the same time (Zhou et al., 2017;Lutnick et al., 2019). Using such a strategy, the most difficult and diverse instances will be annotated first, which are more valuable for model learning with respect to the rest of the corpus. After the instances have been selected by active learning, the system displays them in the annotator interface with the highlighted suggestion labels. The annotator can then accept or modify the suggestion. The choices are stored and passed to the active learning module as new training data to update the parameters.
For analyzing the effectiveness of active learning strategies in FITAnnotator, we conduct a simple but representative comparative experiment based on the IMDb movie reviews sentiment classification task (Maas et al., 2011). In this experiment, we respectively explore the effectiveness of the uncertainty sampling and the diversity sampling in active learning (Fu et al., 2013), and employ a random sampling strategy as the baseline method. Two kinds of popular text classification models (Fast-Text (Joulin et al., 2017) and BERT (Devlin et al., 2019)) are respectively implemented as the backbone of active learning. We use accuracy+ as the indicator to measure the performance (Lu et al., 2019): where N is the size of dataset, H and M represent the human-annotated labels and the modelpredicted labels respectively. The evaluation is continuously carried out with the annotation process of the IMDb training set. Every 100 new annotation samples are generated, the performance of the backbone is evaluated on the standard test set. The results are shown in Figure 6. Apparently, the BERT-based active learning method outperforms the FastText-based method. In terms of training convergence speed, the sampling strategy based on the uncertainty criterion is similar to the diversity criterion, but both of them are obviously faster than the random sampling baseline. After plenty of samples are labeled, the accuracy of those sampling methods tends to be approximate. This observation demonstrates that our system is able to accelerate the training process of the models by introducing active learning algorithms, so as to provide users with label recommendations more quickly and accurately.

Incremental Learning
Existing annotation tools focus on labeling instances based on a fixed annotation scheme. However, the pre-defined standards may not cover all the cases met in the annotation process, especially Figure 6: Results of different active learning strategies and models over imdb dataset. Curves start from 10 at along x-axis.
for the classification task with constantly updated source data. Take the case of aspect category classification (ACC). In E-commerce platforms, online reviews are valuable resources for providers to get feedback for their services. ACC aims to identify all the aspects discussed in a given review. Yet in the real world, new reviews and products are rapidly emerging, and it is impossible to annotate reviews with a pre-defined set of aspect categories once to cover all aspects (Toh and Su, 2015;Wu et al., 2018). Considering the enormous cost of re-labeling the entire corpus, in an ideal annotation system, the new classes should be integrated into the existing labeled instances, sharing the previously learned parameters of active learning. To this end, we introduce an incremental learning mechanism into our annotation system. As Figure 5 shown, by creating a prototype for each category, the classification problem is converted into a problem of matching the samples to the prototypes (Yang et al., 2018a). During the training process, the loss function is designed to minimize the distance between the sample and the prototype (m in Figure 5 is the minimal margin between prototypes) and maximize the distance between prototypes. Thus the space of representation is sparse and clear outside of prototype clusters, a new prototype of the category can be added easily (Rebuffi et al., 2017).
To verify the effectiveness of FITAnnotator combined with incremental learning, we conduct experiments on the AG News dataset 6 , which is collected from the news corpus with four classes. In order to simulate the real-world scenario, we first use samples belonging to three of the four categories for annotation. After labeling 1000 samples, we import the data of the fourth category, and use the class-incremental function provided by FITAnnotator to change the annotation schema. For evaluation, we construct a word-level LSTM + CNN representation model with glove word embedding (Pennington et al., 2014) as the encoder, and compare our prototype-based method with the classic softmax-based classifier. The micro-F1 score is chosen as the evaluation metric. Figure 7 illustrates the experimental results. In the ordinary text classification task, the performance of the softmax-based classifier and the prototype-based classifier is relatively approximate. After introducing the fourth class (new class), the performance of the softmax-based classifier occurs a catastrophic recession. On the contrary, the prototype-based method shows impressive results in the class-incremental scenario, and the negative effect of the newly introduced class is negligible.