Low Resource Sequence Tagging using Sentence Reconstruction

This work revisits the task of training sequence tagging models with limited resources using transfer learning. We investigate several proposed approaches introduced in recent works and suggest a new loss that relies on sentence reconstruction from normalized embeddings. Specifically, our method demonstrates how by adding a decoding layer for sentence reconstruction, we can improve the performance of various baselines. We show improved results on the CoNLL02 NER and UD 1.2 POS datasets and demonstrate the power of the method for transfer learning with low-resources achieving 0.6 F1 score in Dutch using only one sample from it.


Introduction
The increased popularity of deep learning led to a giant leap in natural language processing (NLP). Tasks such as neural machine translation (Lample et al., 2018a;Gu et al., 2018), sentiment analysis (Patro et al., 2018) and question answering (Ran et al., 2019) achieved impressive results.
A major limitation of deep learning is the need for huge amounts of training data. Thus, when dealing with low resource datasets, transfer learning is a common solution. A popular approach in NLP is training a language model for getting a good context-based word representation. Language models such as Bert (Devlin et al., 2019), Roberta (Liu et al., 2019b), ELMO (Peters et al., 2018), and XLnet (Yang et al., 2019) that are trained on very large corpora, are used by the community for different NLP tasks. This "transfer-learning" across tasks within the same language relies on fine-tuning a language model for a specific task (Sun et al., 2019). This work focuses on transfer learning between different languages. Some approaches have been suggested for it. Yang et al. (2017) have proposed using joint training with a large dataset as a source and a small dataset as a target. Zou et al. (2018) have shown how by aligning sentence representations using an adversarial loss, they were able to transfer knowledge between two languages.
Contribution. This work analyzes the contribution of various techniques proposed for transfer learning between languages for the task of sequence tagging. In particular, we evaluate joint training and adversarial learning. Moreover, we propose a novel regularization technique, namely, we add a reconstruction loss with 2 normalization. We show that the addition of this loss improves the performance of various sequence tagging tasks when doing transfer learning.
Our strategy shows promising results for training models without being language-specific, which saves expensive labeling time. An important characteristic of our technique is its ability to provide good tagging in "few-shot learning" (Fei-Fei et al., 2006). We achieve this result by adding to the small dataset, a larger corpus corresponding to another language. Our proposed loss improves the transfer of information and thus the tagging accuracy. We demonstrate our approach on the ConLL02/03 and the Universal Dependency (UD) 1.2 datasets.

Related Work
Solving sequence tagging tasks, such as named entity recognition (NER) or part of speech (POS), using statistical methods has been studied for more than two decades. Early solutions used hidden markov models (HMMs) (Bikel et al., 1997), support-vector machines (SVMs) (Isozaki and Kazawa, 2002) and conditional random fields (CRF, Lafferty et al., 2001), we focus on a more Figure 1: Proposed Method. Notice that the reconstruction loss labels are taken from the embeddings lookup table. This can be replaced by context-aware embeddings. The LSTMs are language-specific and are fed by the relevant embeddings per sample. We normalize the sentence representation for all sentences and the word representation as well. modern approach using common deep learningbased approaches that significantly improve the performance. Collobert et al. (2011) demonstrated the great potential of using neural networks for various NER tasks.  proposed the Bidirectional-LSTM (Bi-LSTM) CRF and Lample et al. (2016) presented a promising architecture for NER by adding character embeddings to its input. Peng and Dredze (2016) used recurrent neural networks (RNN) for NER and word segmentation in Chinese. In the context of transfer learning for sequence tagging, Yang et al. (2017) showed that by using hierarchical RNNs and joint training, it is possible to transfer knowledge between domains of different corpora and different languages. Cao et al. (2018) exhibited that using selfattention and an adversarial loss, they were able to perform transfer learning between two different domains in Chinese. Yadav et al. (2018) showed that Deep Affix Features is beneficial to NER. Jiang et al. (2019) used DARTS neural architecture search (Liu et al., 2019a) to improve NER. Lin et al. (2018) showed that by using multi-lingual multi-task architecture they were able to get interesting results. Devlin et al. (2019) introduced a new representation scheme for NLP tasks achieving impressive NER results.  proposed a new method for getting improved representations of Bi-LSTM of sentence encoders using labeled and unlabeled data. Barone and Valerio (2016) showed that using an adversarial loss (Goodfellow et al., 2014) may lead to a better word representation. In addition, Adel et al. (2018) used an adversarial loss for getting better sentence representation. Tzeng et al. (2017) demonstrated how by aligning deep representations using an adversarial loss, they transfer knowledge from one domain to another. Lample et al. (2018a) exhibited this approach for unsupervised machine translation. Inspired by these strategies, we propose a method for transfer learning between different languages for sequence tagging. Specifically, we focus on sentence representation alignment.

Our Approach
This section describes our sentence reconstruction approach for improving low resource sequence tagging tasks. Many successful sequence tagging network models are composed of an encoder-decoder structure. We suggest adding to them a new decoder branch comprised of a fully convolutional network (FCN) and an 2 loss term for reconstructing the word embeddings of the input sentence. To analyze the effectiveness of our proposed technique, we evaluate its contribution compared to other recently proposed strategies for transfer learning across languages: weight sharing and adversarial alignment. For completeness, we briefly  (Yang et al., 2017), using sentence reconstruction (L2), using weight sharing based transfer learning (TL), using the adversarial loss and combining them all together.
describe the baseline we are using and each of these methods. Then, we present our new auxiliary loss.

Baseline
Our base model follows Lample et al. (2016). Specifically, we run an LSTM (Hochreiter and Schmidhuber, 1997) on the character tokens, concatenate the output to the word embeddings and run an additional LSTM. We then feed its output, denoted z, to another LSTM with a CRF at its end, which produces the sequence tagging, whether it is POS or NER. See Fig. 4 for our baseline.

Weight sharing
Yang et al. (2017) have shown that sharing weights between architectures that correspond to different languages leads to transferring knowledge between them. Our joint training model is inspired by their "Cross Lingual Transfer" with the difference that we use a single CRF that is applied to the output of both LSTMs. See Fig. 3 for a schematic of the our modified version.

Adversarial loss
The baseline described above essentially learns a sentence hidden representation, z. For aligning representations from different languages, we feed this feature vector to a 1D CNN which encodes it and outputs a softmax class and acts as a discriminator. We add a switch layer in the input ES NL EN (Gillick et al., 2015) 82.95 82.84 86.50 (Luo et al., 2015) --91.20 (Lample et al., 2016) 85.75 81.74 90.94 (Yang et al., 2017) 85.77 85.19 91.26 (Lin et al., 2018) 85.88 86.55 - (Yadav et al., 2018) 87  that arbitrates between feeding sentences from the source and target language (each uses its respective word embedding). We train the discriminator on the normalized hidden representations generated by each sentence Z = z/||z|| 2 . Thus, given the possible labels l i , l j of the predicted language, for an input with label l i /l j , the discriminator will try to predict l i /l j . The generator will try to fool the discriminator and cause it to predict the opposite (l j /l i ). The adversarial loss L adv is the sum of the discriminator loss L D and the generator loss L G as follows (Lample et al., 2018a): where s i is the input sentence, e(·) the encoder function, and θ D and θ enc are the discriminator's and the encoder's parameters, respectively.

Reconstruction loss
An adversarial training scheme can still reach trivial representations, meaning the generator produces sentence representations that do not contain meaningful information of the original sentences. There-   fore, we propose using the 2 loss for reconstructing the input sentence (word embeddings). We do so by applying on the hidden representation z a 1D FCN with 5 layers, convolution kernels of size 3 and the ReLU non-linearity. Notice that z is a sequence of embedding vectors. Thus, the output of the FCN is also a sequence of vectors, where each of them tries to estimate the embedding of the corresponding word in the input sentence. If the generated sentence is of a different length than the input, we use the padding embedding vector to make them even. We train this decoder together with the encoder in the network using the following reconstruction loss where θ dec are the FCN parameters, e i is the embedding of the ith word in the input sentence and e i is the corresponding reconstructed embedding, which we normalize. The reconstruction loss acts as a regularization term, which improves results also when used by itself (see the ablation study). We would like to emphasize the importance of normalizing the representing vectors. Its motivation is in the fact that transforming the vectors onto a unit sphere causes the model to learn to maximize  Table 4: Low resource testing for part of speech on UD 1.2 dataset. For each language we ran 3 random seeds and report the mean and std for the baseline and the proposed method.
the similarity between sentences and words. Figure 1 presents a model with all the discussed regularization techniques. Notice that each component in this model can be applied separately. For example, we may apply our new reconstruction loss alone, or as an additional branch to the adversarial branch with or without weight sharing.

Experiments
We follow the experiments of Yang et al. (2017) to evaluate our approach for transfer learning between languages. We compare our proposed regularization to joint training and the adversarial loss. We start by evaluating the impact of each strategy alone, and then gradually combine the losses to each other. Our source-target pairs are built of English and a selected target language (Spanish, Dutch or Romanian). In NER, we test both directions of transfer learning, i.e English to Spanish and Spanish to English. In POS, English is always the source language. We focus on using word embeddings that are aligned across different languages, specifically "MUSE" (Lample et al., 2018b). Our motivation for choosing it is to leverage the word alignment, which makes the impact of the sentence alignment clearer.
Loss analysis. For understanding the impact of our approach, we test it with and without the other techniques for transfer learning between languages. We also compare to each of them being applied separately. Table 1 summarizes our results. Notice that our proposed loss improves the performance when combined with other methods and even when being applied alone. Also, we have found that the improvement gained by the adversarial loss is   marginal and therefore, we do not use it in the final model used in the next experiments, which consist of only weight sharing and our proposed 2 reconstruction loss.
Results. We evaluate our model on three tasks: (i) NER transfer learning compared to leading methods; (ii) NER transfer learning on a subset of the target data; and (iii) POS transfer. We achieve competitive results on Conll2002 Dutch/Spanish. For testing how competitive our approach is, we also compare to state-of-the-art methods. Moreover, we perform experiments on subsets of the data similar to Yang et al. (2017). These experiments exhibit the advantage of our model, especially when training on scarce data. For example, we show that using only nine samples in Spanish (0.001 of the data) we get an F1 score of 0.59 (compared to the 0.16 transfer learning result of Yang et al. (2017)). Table 2 shows the NER results, where we get competitve results in ConLL02 and improve our baseline in English ConLL03. Table 4 shows how our method generalizes well for low resource transfer learning in POS. Notice the great improvement between our baseline as shown in Fig. 4 and our method shown in Fig. 1. Table 3 demonstrates the performance on POS, where we get the largest improvement on Romanian, which is a low resource language (with fewer labels).  advantage of our regularization for few-shot learning compared to Yang et al. (2017) and Lin et al. (2018). Finally, Table 6 and Table 7 presents the results of our approach for "one-shot" learning compared to Lin et al. (2018) and "zero-shot" learning. A major improvement compared to our baseline is apparent also here. We found for the case of fewshot and one-shot learning that it is better to share the base BiLSTM because it does not see enough examples to train.

Conclusion
This work demonstrates the power of sentence reconstruction for transferring knowledge from a rich dataset to a sparse one. It achieves competitive results with a relatively simple baseline. We also show its strength in few-shot and one-shot learning.
We believe that using the proposed sentence 2 reconstruction may contribute as an auxiliary loss for other tasks. Also, we have demonstrated our model with MUSE, since it provides word alignment across languages. Yet, our approach can be applied also with other more recent language models that have stronger context-based embeddings.