Data Annealing for Informal Language Understanding Tasks

There is a huge performance gap between formal and informal language understanding tasks. The recent pre-trained models that improved formal language understanding tasks did not achieve a comparable result on informal language. We propose data annealing transfer learning procedure to bridge the performance gap on informal natural language understanding tasks. It successfully utilizes a pre-trained model such as BERT in informal language. In the data annealing procedure, the training set contains mainly formal text data at first; then, the proportion of the informal text data is gradually increased during the training process. Our data annealing procedure is model-independent and can be applied to various tasks. We validate its effectiveness in exhaustive experiments. When BERT is implemented with our learning procedure, it outperforms all the state-of-the-art models on the three common informal language tasks.


Introduction and Related Work
Because of the noisy nature of the informal language and the shortage of labeled data, the progress on informal language is not as promising as in formal language. Many tasks on formal data obtain a high performance due to deep neural models (Peters et al., 2018;Devlin et al., 2018). However, these state-of-the-art models' excellent performance usually fails to transfer to informal data directly. For example, when a BERT model is fine-tuned on informal data, its performance is less encouraging than on formal data. It is because of the domain discrepancy between the pre-training corpus used by BERT and the target data.
To solve the issues mentioned above, we propose a model-agnostic data annealing procedure. We set informal data as target data and set formal data as source data. The training data first contains mainly source data, when data annealing procedure takes the advantages of a proper parameter initialization from the clean nature of formal data. The proportion of source data keeps decreasing exponentially while the proportion of target data keeps increasing, which empowers the model with more freedom to explore the direction of its next update.
The philosophy behind data annealing is shared with other commonly used annealing techniques. One popular usage of annealing is learning rate annealing. A gradually decayed learning rate enhances the model with more freedom of exploration at the beginning and leads to better model performance (Zeiler, 2012;Yang and Zhang, 2018;Devlin et al., 2018). Another widespread implementation of annealing is simulated annealing (Bertsimas and Tsitsiklis, 1993). It reduces the probability of a model converging to a bad local optimal by introducing random noise in the training process. Data annealing has similar functionality with simulated annealing but replaces random noise with source data. By doing this, the model explores more space at the beginning of the training process and is guided by the knowledge learned from the source domain.
Current state-of-the-art models on informal language tasks are usually designed specifically for a particular task and cannot generalize to different tasks (Kshirsagar et al., 2018;Gui et al., 2018). Data annealing is model-independent and could be employed in various informal language tasks. We validate our learning procedure with two popular neural network models in NLP, LSTM, and BERT, on three popular natural language understanding tasks, i.e., named entity recognition (NER), partof-speech (POS) tagging and chunking on twitter.
When BERT is fine-tuned with data annealing procedure, it outperforms all three state-of-the-art models with the same structure. By doing this, we also set the new state-of-the-art result for the three informal language understanding tasks. Experiments also validate our data annealing procedure's effectiveness when there are limited training resources in target data.

Data Annealing
A pre-trained model like BERT is suggested to avoid over-training when implemented on downstream task (Peters et al., 2019;Sun et al., 2019). In transfer learning, It is not ideal to feed too much source data, as it not only prolongs the training time but also confuses the model. Therefore, we propose data annealing, a transfer learning procedure that adjusts the ratio of the formal source data and the informal target data from large to small in the training process to solve the overfitting and the noisy initialization problems.
At the first stage of data annealing, most of the training samples are source data. Therefore the model obtains a proper initialization from the abundant clean source data. In the second stage, as we gradually increase the proportion of the target data and reduce the proportion of the source data, the model explores a larger parameter space. Besides, the labeled source dataset works as an auxiliary task. At the third stage of the training process, most of the training data is target data so that the model focus on the target information more.
We reduce the source data proportion exponentially. α represents the initial proportion of the source data. t represents the current training step, and m represents the number of batches in total. λ represents the exponential decay rate of α. r t S and r t T represent the proportion of the source data and proportion of target data at time step t.
Let D S represents the accumulated source data used to train the model, and let B represents the batch size. We have After the model is updated for adequate batches, we can approximate D S using D S could be empirically decided based on the relation between source dataset and target dataset. For example, the higher the similarity between the source and the target data, the more knowledge the target task could borrow from the source task, and larger D S is. If researchers want to simplify the hyper-parameters tuning process or constrain the influence of source data, α can be set by D S :

Experimental Design
We validate it by two popular model LSTM and BERT on three tasks: named entity recognition (NER), part-of-speech tagging (POS), and chunking. These tasks have much better performance on formal text (such as news) than informal text (such as tweets).

Datasets
We use OntoNotes-nw (Ralph Weischedel, 2013) as the source dataset, and Ritter11-NER dataset (Ritter et al., 2011) as the target dataset to validate the NER task. While we use Penn Treebank (PTB) POS tagging dataset (Mitchell P. Marcus, 1999) as the source data set, and Ritter11-POS (Ritter et al., 2011) as the target dataset in the POS tagging task. For the chunking task, we use CoNLL 2000 (Sang and Buchholz, 2000) as the source dataset, and Ritter11-CHUNK (Ritter et al., 2011) as the target dataset. Please refer to Appendix B for more details about datasets.

Model Setting
We implemented BERT and LSTM to validate the effect of data annealing on all three tasks. BERT. We implemented both BERT BASE model and BERT LARGE model. CRF has been validated as a good classifier by many researchers (Lafferty et al., 2001;Tseng et al., 2005). We use CRF as a decoder on the top of the BERT structure. In some tasks, the source dataset and target dataset do not have the same set of labels. Therefore, we use two separate CRF decoder for source task and target task. LSTM. We used character and word embedding as input features following previous works (Yang and Zhang, 2018;Yang et al., 2017). We use one layer bidirectional LSTM to process the input features.
For the same reason as in the implementation of BERT, we use two separate CRF classifiers on the top of the LSTM structure. We compare data annealing with two popular transfer learning paradigms, parameter initialization (INIT) and multi-task learning (MULT) (Weiss et al., 2016;Mou et al., 2016). Now we introduce the training procedure in experiments. Data annealing. In all data annealing experiments, the initial source data ratio α and decay rate λ are tuned in range (0.9, 0.99). When training the BERT model, we also calculated the estimated total batches from source data D S that fed into the model by equation 5. By avoiding a large D S , the model has a lower probability of suffering from catastrophic forgetting as mentioned in section 2. MULT. Multi-task transfer learning optimizes an auxiliary task to improve the performance on the target task. We implemented MULT on both LSTM-CRF and BERT-CRF structure. In all MULT experiments, following Yang et al. (2017) and Collobert and Weston (2008), we tune the ratio of source data in range (0.1, 0.9). INIT. Parameter initialization transfer learning transfers weights from a pre-trained model to improve the performance of the target model. We implemented INIT on BERT-CRF structure. In all INIT experiments, we run three times on source data and conduct weight transferring on the model that achieves the highest performance. In INIT, the target model benefits from a good initialization with contains knowledge from source dataset.

Experiment Results
The result of the three tasks is shown in Table 1. Vanilla means the model is trained without transfer learning and only utilizes the target data. DA means the model is implemented with data annealing procedure. All the numbers in the tables are the average result of three runs. It is worth noting that state-of-the-art results on these three tasks are achieved by different models and complicated adaptation methods. Meanwhile, our proposed data annealing algorithm is applied to the same structure without fancy decoration across different tasks. Within our appropriate range set of (0.9, 0.99) for α and λ, we find the data annealing consistently outperforms other transfer learning methods and the state-of-the-art method. In most cases, it is a moderate annealing speed that leads to an optimal result. We noticed that the improvement in recently reported literature on these tasks is usually less than 0.5 in absolute value on either F 1 or accuracy (Gui et al., 2018;Lin and Lu, 2018). Our data annealing moves the state-of-the-art performance a big step forward. For more experiment detail such as hyper-parameters, please refer to Appendix C Named Entity Recognition (NER). Our annealing procedure outperforms other transfer learning procedures in terms of F 1 , meaning our data annealing is especially effective in striking a balance between the precision and recall in extracting named entities. Usually, a sentence contains more words that are not entities. So if the model is not sure whether a word is an entity, the model is likely to predict it as not an entity in order to reduce the training loss. The state-of-the-art models achieved high precision but low recall by using several adaptation methods. It indicates that the state-of-the-art methods achieve high performance by predicting fewer entities, while BERT models receive high performance by both covering more entities and predicting them correctly.
Part-of-speech Tagging (POS tagging). All the BERT models and LSTM models under our data annealing procedure outperform other transfer learning procedures. The improvement over the state-ofthe-art model DCNN (Gui et al., 2018) is 1.37 in accuracy measure in POS tagging. It is worth noting that improvement in this task was limited before our work. For example, DCNN only improved 0.26 in accuracy comparing research works before it. Our method also outperforms a recent pre-training work BERTweet (Nguyen et al., 2020) by 2.24 in accuracy. Chunking. When LSTM, BERT BASE , and BERT LARGE are used as the training model under our data annealing procedure, they achieve better performances compared to other transfer learning paradigms. Our best model outperforms the stateof-the-art model by 3.03 in F 1 .  Table 1: Results on NER, POS tagging and chunking task. * means the difference between DA BERT LARGE and state-of-the-art results. ** means the state-of-the-art for these three tasks are achieved by different models. Listed state-of-the-art NER and POS tagging result came from Lin and Lu (2018), Gui et al. (2018). Since Yang et al. (2017) proposed the state-of-the-art model on informal chunking task but experimented on a different informal text dataset, we implement their model on Ritter11-Chunk dataset and report the result.
The Dataset Size Influence. To further evaluate data annealing when there is limited labeled data, we randomly sample 10%, 20%, and 50% of the training set in Ritter11-NER. Then we compare our proposed DA BERT LARGE with INIT BERT LARGE and Vanilla BERT LARGE baselines. We take the average performance of 5 runs for each model. The result in Figure 1 shows that our model is still better than INIT BERT LARGE on the condition of a limited resource and achieves a significant improvement over Vanilla BERT LARGE baseline.

Error Analysis
We did an error analysis in Ritter11-NER dataset. We randomly sampled 30 sentences that contain entities that are incorrectly predicted by DA BERT LARGE and attached them in Appendix A. We found that a relatively large proportion of sentences has a too strong noisy feature to be predicted correctly. This feature is embedded in the informal text, and we might need to explore more on the nature of informal language to solve it perfectly. We also calculated the F 1 score of the ten predefined entity types. We find that compared with Vanilla BERT LARGE and INIT BERT LARGE , DA BERT LARGE achieves higher F 1 score on two frequent entities, "PERSON" and "OTHER". "PERSON" is a frequent concept in formal data. It shows our method learns to utilize formal data knowledge to improve "PERSON" detection. Besides, "OTHER" means entities that are not in the ten pre-defined entity types. Higher performance on "OTHER" suggests DA BERT LARGE has a better understanding of the general concept of an entity. INIT BERT LARGE achieves a higher F 1 score on "GEO-LOC". We did not find a clear difference in other entity types.
Besides, we found that if a word is of a rarely appeared entity type, all the three models are less likely to predict its entity type correctly. We suspect that a neural model implicitly learns to predict a word when it is trained to predict other words in the same entity type since these words could share a similar representation in the NER task. We plan to assign more penalty to infrequent entity types to tackle this issue in the future.

Conclusion
In this paper, we propose data annealing, a modelindependent transfer learning procedure for informal language understanding tasks. It applies to various models such as LSTM and BERT. It has been proven as a good approach to utilizing knowledge from formal data to informal data by exhaustive experiments. When data annealing is applied with BERT, it outperforms different state-of-the-art models on different informal language understanding tasks. Since large pre-trained models have been widely used, it could also serve as an excellent fine-tuning method. Data annealing is also useful when there are limited labeled resources.

B Dataset Statistic
We show the statistic of all the datasets used in this paper. The three informal text datasets Ritter11-NER, Ritter11-POS and Ritter11-CHUNK are all created by Ritter et al. (2011). However, different research work has been using different name for these datasets. Here we name each dataset as the concatenation of the most used name "Ritter11" and the name of the task.

C Hyper-parameters and Training process
We introduce the detail of the experiment in this section for the reproduction of our results. Max training epoch is 20 for all LSTM models and 10 epochs for all BERT models. Adam optimizer with β1 as 0.9, β2 as 0.999, L2 weight decay as 0 is used for all LSTM models. The learning rate for all LSTM model is chosen between 1e-2 to 1e-4. AdamW (Loshchilov and Hutter, 2019) with β1 as 0.9, β2 as 0.999, L2 weight decay as 0.01 is used for all BERT models. Batch size in all LSTM and BERT models is set to be 8. The warmup ratio is set to be 0.1 for all LSTM and BERT models. For the INIT transfer learning setting, we pick the model that achieves the highest performance as a source model. For MULT transfer learning, the ratio of source data among the mixed data is in range (0.1, 0.9). In detail, the ratio 0.4 for NER task, 0.5 for Chunking task, 0.5 for POS Tagging task. For data annealing setting, within our appropriate range set of (0.9, 0.99), we find the data annealing constantly outperforms other transfer learning methods and the state-of-the-art method. We set α to be 0.95 and γ to be 0.9 for NER task, α to be 0.99 and γ to be 0.95 for Chunking task, α to be 0.95 and γ to be 0.95 for POS Tagging task. All the hyper-parameters are tuned on the development set of the corresponding dataset. The results are reported on the test set.