Character-level Supervision for Low-resource POS Tagging

Neural part-of-speech (POS) taggers are known to not perform well with little training data. As a step towards overcoming this problem, we present an architecture for learning more robust neural POS taggers by jointly training a hierarchical, recurrent model and a recurrent character-based sequence-to-sequence network supervised using an auxiliary objective. This way, we introduce stronger character-level supervision into the model, which enables better generalization to unseen words and provides regularization, making our encoding less prone to overfitting. We experiment with three auxiliary tasks: lemmatization, character-based word autoencoding, and character-based random string autoencoding. Experiments with minimal amounts of labeled data on 34 languages show that our new architecture outperforms a single-task baseline and, surprisingly, that, on average, raw text autoencoding can be as beneficial for low-resource POS tagging as using lemma information. Our neural POS tagger closes the gap to a state-of-the-art POS tagger (MarMoT) for low-resource scenarios by 43%, even outperforming it on languages with templatic morphology, e.g., Arabic, Hebrew, and Turkish, by some margin.


Introduction
POS tagging, i.e., assigning syntactic categories to tokens in context, is an important first step when developing language technology for low-resource languages. POS tags can provide an efficient inductive bias for modeling downstream tasks, especially if training data for these tasks are limited.
However, POS tagging can be very challenging if only a few labeled sentences are available. Previous work on POS tagging with limited or no annotated data comes in three flavors, e.g., (Yarowsky et al., 2001;Goldwater and Griffiths, 2007;Li et al., 2012;Biemann, 2012;Täckström et al., 2013;Dong et al., 2015;Agić et al., 2015): unsupervised POS induction, crosslingual transfer, or, if some suitable data are available, supervised induction from small labeled corpora or dictionaries. This work focuses on the latter: We explore the effect of multi-task learning for building robust POS taggers for low-resource languages from small amounts of annotated data.
In low-resource settings, neural POS taggers have been observed to perform poorly compared to log-linear models. This is unfortunate, since neural POS taggers have other advantages, including being easily integrable into multi-task learning architectures, sidestepping feature engineering, and providing compact word-level and sentence-level representations. In this paper, we therefore take steps to bridge the gap to state-of-the-art taggers in such scenarios.
Specifically, we consider training neural POS taggers from 478 annotated tokens (the size of the smallest treebank in UD 2.0 1 ). In such a setting, it is often useful to leverage data from other, related tasks , if available. However, since for many low-resource languages such data is hard to find, we consider multi-task learning scenarios with no other sequence labeling data at hand: (i) a scenario in which type-based morphological information is available, e.g., wordlemma pairs as can be found in standard dictionaries or UniMorph, 2 (ii) a scenario where we only rely on raw text corpora in the language, and (iii) a scenario where we do not assume any additional data, but construct a synthetic auxiliary task instead.
In order to include secondary information such as word-lemma pairs into our model, we integrate a character-based recurrent sequence-to-sequence model into a hierarchical long short-term memory (LSTM) sequence tagger (cf. Figure 1). By formulating suitable auxiliary tasks (lemmatization, word autoencoding or random string autoencoding, respectively), we can include additional character-level supervision into our model via multi-task training.
Contributions. We present a novel architecture for inducing more robust neural POS taggers from small samples of annotated data in low-resource languages, combining a hierarchical, deep bi-LSTM sequence tagger with a character-based sequence-to-sequence model. Furthermore, we experiment with different choices of external resources and corresponding auxiliary tasks and show that autoencoding can be as efficient as an auxiliary task for low-resource POS tagging as lemmatization. Finally, we evaluate our models on 34 typologically diverse languages.

POS Tagging with Subword-level Supervision
Hierarchical POS tagging LSTMs that receive both word-level and subword-level input, such as , are known to perform well on unseen words. This is due to their ability to associate subword-level patterns with POS tags. However, hierarchical LSTMs are also very expressive, and thus prone to overfitting. We believe that using subword-level auxiliary tasks to regularize the character-level encoding in hierarchical LSTMs is a flexible and efficient way to get the best of both worlds: such a model is still able to make predictions about unknown words, but the subword-level auxiliary task should prevent it from overfitting.

Hierarchical LSTMs with Character-level Decoding
Our proposed multi-task architecture is shown in Figure 1. For the hierarchical sequence labeling LSTM, we follow : Our Figure 1: Our multi-task architecture, consisting of a shared character LSTM (down), as well as a sequence labeling (up) and a sequence-tosequence (right) part.
subword-level LSTM is bi-directional and operates on the character level (Ling et al., 2015;Ballesteros et al., 2015). Its input is the character sequence of each input word, represented by the embedding sequence c 1 , c 2 , . . . , c m . The final character-based representation of each word is the concatenation of the two last LSTM hidden states: Second, a context bi-LSTM operates on the word level. Like , we use the term "context bi-LSTM" to denote a bidirectional LSTM which, in order to generate a representation for input element i, encodes all elements up to position i with a forward LSTM and all elements from n to i using a backward LSTM. For each sentence represented by embeddings w 1 , w 2 , . . . , w n , its input are the concatenation of the word embeddings with the outputs of the subword-level LSTM: conc(w 1 , v c,1 ), conc(w 2 , v c,2 ) . . . , conc(w n , v c,n ). The final representation which gets forwarded to the next part of the network is again the concatenation of the last two hidden LSTM states: This is then passed on to a classification layer.

Character-based Decoding
We extend the network with a new component, a character-based sequence-to-sequence model. It consists of a bidirectional LSTM encoder which is connected to an LSTM decoder (Cho et al., 2014;Sutskever et al., 2014).
Encoding. The encoder corresponds to the character-level bi-directional LSTM described above and thus yields the representation LSTM c,b (c m:1 )) for an input word embedded as c 1 , c 2 , . . . , c m . Parameters of the character-level LSTM are shared between the sequence labeling and the sequenceto-sequence part of our model.
Decoding. The decoder receives the concatenation of the last hidden states v c,i as input. In particular, we do not use an attention mechanism (Bahdanau et al., 2015), since our goal is not to improve performance on the auxiliary task, but instead to encourage the encoder to learn better word representations. The decoder is trained to predict each output character y t dependent on v c,i and previous predictions y 1 , ..., y t−1 as p(y t |{y 1 , ..., y t−1 }, v c,i ) = g(y t−1 , s t , v c,i ) (4) for a non-linear function g and the LSTM hidden state s t . The final softmax output layer is calculated over the vocabulary of the language.
Joint model. Figure 1 shows how parameters are shared between the sequence labeling and the sequence-to-sequence components of our network. All model parameters, including all embeddings, are updated during training. Our model architecture is "symmetric", i.e., it does not distinguish between main and auxiliary tasks. However, we use early stopping on the development set of the main task, such that convergence is not guaranteed for the auxiliary tasks.

Multi-task Learning
We want to train our neural model jointly on (i) a low-resource main task, i.e., POS tagging, and (ii) an exchangeable auxiliary task (cf. §3). Therefore, we want to maximize the following joint loglikelihood: Here, D P OS denotes the POS tagging training data, with s being the input sentence and l the corresponding label sequence. D aux is a placeholder for our auxiliary task training data with examples consisting of input in and output out. We experiment with three different auxiliary tasks, which will be described in the next section. The set of model parameters θ is the union of the set of parameters of the sequence labeling and the sequence-to-sequence part. Parameters of the character LSTM are shared between the main and the auxiliary task.

(Un)supervised Auxiliary Tasks
In this section, we will describe our three auxiliary tasks in more detail.

Random String Autoencoding
Random string autoencoding is a synthetic auxiliary task created for a setting in which we have no additional resources available. It consists of, given a random character sequence as input, reconstructing the same sequence in the output. Concretely, given the alphabet A L of a language L, the task is to learn a mapping r → r for r ∈ A + L . Note that the random string r is in most cases not a valid word in L. Additionally, we prepend a special symbol S r to the input which indicates the current task to the encoder, e.g., "OUT=AE" or "OUT=POS".

Word Autoencoding
Word autoencoding is a special case of the previous auxiliary task, in that we now use actual words in the language, e.g., from unlabeled corpora or dictionaries. As for random string autoencoding, the task consists of reproducing a given input character sequence in the output. As before, we additionally feed a special symbol S w into the model which signals the current task to the encoder. Our final training examples are of the form (S w ; w) → w, where w ∈ V L for the vocabulary V L of L. Word autoencoding has been used as an auxiliary task before, e.g., (Vosoughi et al., 2016).

Lemmatization
Lemmatization is a task from the area of inflectional morphology. In particular, it is a special case of morphological inflection. Its goal is to map a given inflected word form to its lemma, e.g., sueño → soñar.
Sequence-to-sequence models have shown strong performances on morphological inflection (Aharoni et al., 2016;Kann and Schütze, 2016;Makarov et al., 2017). Therefore, when morphological dictionaries are available, we can easily combine a neural model for lemmatization with a POS tagger, using our architecture. Our intuition for this auxiliary task is that it should be possible to include morphological information into our character-based word representations. Formally, the task can be described as follows. Let A L be a discrete alphabet for language L and let T L be a set of morphological tags for L. The morphological paradigm π of a lemma w in L is a set of pairs L is an inflected form, t k ∈ T L is its morphological tag and T (w) is the respective set of paradigm slots. Lemmatization consists of predicting the lemma w for an inflected form f k [w ] in π(w).

Experimental Setup
In this section, we will describe our experiments, including data, baselines, and hyperparameters.

Data
POS. The data for our POS tagging main task comes from the Universal Dependencies (UD) 2.0 collection (Nivre et al., 2007). We use the provided train/dev/test splits.
Since we use the official datasets from the SIG-MORPHON 2017 shared task on universal morphological reinflection (Cotterell et al., 2017) for the lemmatization auxiliary task, we limit ourselves to the languages featured there. We simulate a low-resource setting by reducing all training sets to 478 tokens. Among our languages, this is the size of the smallest training set in UD 2.0.
Lemmatization. For the lemmatization auxiliary task, we make use of the word-lemma pairs in the training sets released for the SIGMORPHON 2017 shared task (Cotterell et al., 2017), which are subsets of the UniMorph data. In particular, there are three settings with different training sets per language: low (100 examples), medium (1,000 examples) and high (10,000 examples).
Word autoencoding. For the word autoencoding task, we use the inflected forms from the SIG-MORPHON 2017 shared task dataset for each respective setting. Due to identical forms for different slot in the morphological paradigm of some lemmas, we might have duplicate examples in those datasets.
Random string autoencoding. For the random string autoencoding auxiliary task, we generate random character sequences to be used as training instances for our model's encoder-decoder part. In order to have the same amount of unique characters as with the other two auxiliary tasks, we use the character sets from the SIGMORPHON shared task vocabulary for each respective setting. We then uniformly draw characters from these sets and form strings of random lengths between 3 and 20 characters.

Baselines
TreeTagger. Since low-resource settings like the one considered here are known to be challenging for neural models, we employ TreeTagger (Schmid, 1995), a non-neural Markov model tagger, as our first baseline.
Single-task hierarchical LSTM. We compare all our results to a single-task baseline model, which corresponds largely to the architecture used by  for POS tagging. We modify their original code by adding character dropout with a coefficient of 0.25 to improve regularization and make the baseline more comparable to and competitive with our models.
Multi-task baseline. We further compare to a multi-task architecture which jointly learns to predict the POS tag and the log-frequency of a word as suggested by . The intuition described by the original authors is that the auxiliary loss, being predictive of word frequency, can improve the representations of rare words. Note that this baseline can easily be combined with our architecture. We leave the exploration of such a combination for future work.

Hyperparameters
For all networks, we use 300-dimensional character embeddings, 64-dimensional word embeddings and 100-dimensional LSTM hidden states. Encoder and decoder LSTMs have 1 hidden layer each. For training, we use ADAM (Kingma and Ba, 2014), as well as word dropout and character dropout, each with a coefficient of 0.25 (Kiperwasser and Goldberg, 2016). Gaussian noise is added to the concatenation of the last states of the character LSTMs for POS tagging. All models are trained using early stopping, with a minimum number of 75 (single-task and low), 30 (medium) or 20 (high) epochs and a maximum number of 300 epochs, which is never reached. We stop training if we obtain no improvement for 10 consecutive epochs. The best model on the development set is used for testing.

Results
The test results for all languages and settings are presented in Table 1.
Our first observation is that using 100 words of auxiliary task data seems to be sufficient, as we do not see consistent gains from adding more auxiliary task instances. This might be related to the very limited amount of POS tagging data we assume available; a too low main-auxiliary task data ratio probably inhibits further gains.
Second, we find that lemmatization and word autoencoding on average over all languages bring similar gains, differences are only between 0.0013 (medium) and 0.0021 (high) absolute accuracy. Comparing word and random string autoencoding, two observations can be made: in the low setting, differences are small, while random string autoencoding is the only task which performs worse in the high compared to the low setting. So the gap between the two autoencoding tasks grows bigger for larger auxiliary task data. This might be explained by random string autoencoding being helpful in order to get clearer distinctions between characters; however, this might as well destroy the model's ability to pick up on beneficial similarities.
Our third observation is that lemmatization and word autoencoding consistently outperform the auxiliary task of predicting log-frequencies as suggested in  with up to 0.0081 (POS+AE, low/high) higher absolute accuracy; random string autoencoding performs 0.0079 better in the low setting. We may thus conclude that, in our setting, auxiliary tasks with additional character-level supervision are more beneficial.
Fourth, both non-neural baselines outperform the single-task neural model. Adding auxiliary tasks leads to a higher performance (averaged over languages) than TreeTagger. MarMoT is the overall best performing model. However, for some individual languages, the neural model obtains higher accuracies, e.g., for Bulgarian, Dutch, or Romanian. In particular, our approach is stronger for languages with templatic morphology, e.g., Arabic, Hebrew, or Turkish. This emphasizes the importance of neural approaches for the task.
Finally, we look at differences between auxiliary tasks for individual languages. Here, we notice that autoencoders often outperform lemmatization for agglutinative languages. An explanation for this might be that agglutinative morphology is harder to learn, and the chance of overfitting on a small sample is therefore higher. 6 Analysis 6.1 Error Analysis Table 2 lists the F 1 -scores of our models across POS tags, compared to the single-task baseline.
Our first observation is that the decrease in performance from training on more random strings, is relatively equal across tags, with the exception of DET, PUNCT and X; tokens that consist of very few, fixed characters. We also note that all our models with character-level supervision get worse at predicting numerals. In contrast, ADP, AUX, CCONJ and PUNCT always benefit from a character-based auxiliary task. Generally, the POS taggers trained on small amounts of data are challenged by rare syntactic categories such as interjections and the miscellaneous category X.   Table 1: Averaged accuracies and standard deviations over 5 training runs on UD 2.0 test sets, with 478 tokens of POS-annotated data and varying amounts of data for the auxiliary task (low, medium and high).
Best result for each language in bold. Autoencoding and lemmatization are on par across the board, and with 100 training sentences (low), random autoencoding is also competitive.

Why does Random String Autoencoding Help?
In the low setting, i.e., when using only 100 auxiliary task examples, autoencoding, especially of random strings, works better than or equally well as lemmatization for highly agglutinative languages such as Basque, Finnish, Hungarian, and Turkish. Further, while random string autoencoding is in general less efficient than autoencoding or lemmatization, it performs on par with these auxiliary tasks in the set-up with least auxiliary task data. However, this raises the question why random string autoencoding does work at all for a POS tagging main task. We offer two potential explanations: General properties of the auxiliary tasks. Bingel and Søgaard (2017) showed that multi-task learning is more likely to be helpful when the auxiliary loss does not plateau earlier than the main loss. Figure 2 presents the loss curves for one model for each of four randomly selected languages (the corresponding plots for the remaining languages look similar). They show exactly the patterns found to be predictive of multi-task learning gains by , who offer the explanation that when the auxiliary loss does not plateau before the target task, it can help the model out of local minima during training.
Preventing character collisions. A random string autoencoder needs to memorize the input string. This means encoding which characters are at what position in the input sequence. Jointly learning a random string autoencoder thus forces a model to make it easy to differentiate between characters, pushing them apart in vector space. See Table 3 for the average character distances and Table 4 for the minimum character distances across languages for our three systems (low setting) and our single-task baseline. For each system, the score is obtained by first calculating the average distance between all characters or, respectively, finding the minimum distance between any two characters for each language, and then computing the average across all languages.
Rand. Rand.    In small sample regimes, pushing individual characters further apart is a potential advantage, since character collisions can be hurtful at inference time. We note how this is analogous to feature swamping of covariate features, as described in Sutton et al. (2006). Sutton et al. (2006) use a group lasso regularizer to prevent feature swamping. In the same way, we could also detect distributionally similar characters and use a group lasso regularizer to prevent covariate characters to swamp each other. However, this effect can potentially also hurt performance if done in an uninformed way. We intuit that this makes it also impossible for the model to learn useful similarities between characters (random string autoencod-ing in the high setting has a minimum distance of 0.104 compared to 0.018 for the single-task model). This might explain the performance gap between random string encoding and the other two auxiliary tasks for the high setting.

ADJ
7 Related Work POS tagging. POS tagging and other NLP sequence labeling tasks have been successfully approached using bidirectional LSTMs Yang et al., 2016). Although previous work using such architectures often relies on massive datasets,  show that bi-LSTMs in particular are not as reliant on large amounts of data in a sequence labeling scenario as previously assumed. Furthermore, their model is also a multi-task model, being trained jointly on predicting the POS and the log-frequency of a word. Their architecture obtained state-of-the-art results for POS tagging in several languages. Hence, in the low-resource setting considered here, we build upon the architecture developed by , and extend it to a multi-task architecture involving sequenceto-sequence learning. Note though that in contrast to our setup, their tasks are both sequence-labeling tasks and using the same input for both tasks.
The same holds true for the multi-task model by Rei (2017), which is used to investigate how an additional language modeling objective could improve performance for sequence labeling without any need for additional training data. He reported   gains for all investigated tasks, including POS. Finally, Gillick et al. (2016) present a multi-lingual model based on ideas from multi-task training, with each language constituting a separate task.
Multi-task learning in NLP. Neural networks make multi-task learning via (hard) parameter sharing particularly easy; thus, different task combinations have been investigated exhaustively. For sequence labeling, many combinations of tasks have been explored, e.g. by ; Martínez Alonso and Plank (2017); Bjerva et al. (2016); Bjerva (2017a,b); Augenstein and Søgaard (2018). An analysis of task combinations is performed by .  present a more flexible architecture, which learns what to share between the main and auxiliary tasks.  combine multi-task learning with semi-supervised learning for strongly related tasks with different output spaces.
However, work on combining sequence labeling main tasks and sequence-to-sequence auxiliary tasks is harder to find. Dai and Le (2015) pretrain an LSTM as part of a sequence autoencoder on unlabeled data to obtain better performance on a sequence classification task. However, they report poor results for joint training. We obtain different results: even simple sequence-to-sequence tasks can indeed be beneficial for the sequence labeling task of low-resource POS tagging. This might be due to differences in the architectures or tasks.
Cross-lingual learning. Even though we do not employ cross-lingual learning in this work, we consider it highly relevant for low-resource settings and, thus, want to mention some important work here. Cross-lingual approaches have been used for a large variety of tasks, e.g., automatic speech recognition (Huang et al., 2013), entity recognition (Wang and Manning, 2014), language modeling (Tsvetkov et al., 2016), or parsing (Cohen et al., 2011;Søgaard, 2011;Naseem et al., 2012;Ammar et al., 2016). In the realm of sequence-to-sequence models, most work has been done for machine translation (Dong et al., 2015;Zoph and Knight, 2016;Ha et al., 2016;Johnson et al., 2017). Another example is a character-based approach by Kann et al. (2017) for morphological generation.

Conclusion
We explored multi-task setups for training robust POS taggers for low-resource languages from small amounts of annotated data. In order to add additional character-level supervision into a hierarchical recurrent neural model, we introduced a novel network architecture. We considered different available types of external resources (wordlemma pairs, unlabeled corpora, or none) and employed corresponding auxiliary tasks (lemmatization, word autoencoding, or the artificial task of random string autoencoding) as well as varying amounts of auxiliary task data. While we did not find a systematic superior performance of models which were trained with lemmatization as an auxiliary task, the results confirmed our hypothesis that additional subword-level supervision improves POS taggers for resource-poor languages.