Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss

Bidirectional long short-term memory (bi-LSTM) networks have recently proven successful for various NLP sequence modeling tasks, but little is known about their reliance to input representations, target languages, data set size, and label noise. We address these issues and evaluate bi-LSTMs with word, character, and unicode byte embeddings for POS tagging. We compare bi-LSTMs to traditional POS taggers across languages and data sizes. We also present a novel bi-LSTM model, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words. The model obtains state-of-the-art performance across 22 languages, and works especially well for morphologically complex languages. Our analysis suggests that bi-LSTMs are less sensitive to training data size and label corruptions (at small noise levels) than previously assumed.

We consider using bi-LSTMs for POS tagging.Previous work on using deep learning-based methods for POS tagging has focused either on a single language (Collobert et al., 2011;Wang et al., 2015) or a small set of languages (Ling et al., 2015;Santos and Zadrozny, 2014).Instead we evaluate our models across 22 languages.In addition, we compare performance with representations at different levels of granularity (words, characters, and bytes).These levels of representation were previously introduced in different efforts (Chrupała, 2013;Zhang et al., 2015;Ling et al., 2015;Santos and Zadrozny, 2014;Gillick et al., 2016;Kim et al., 2015), but a comparative evaluation was missing.
Moreover, deep networks are often said to require large volumes of training data.We investigate to what extent bi-LSTMs are more sensitive to the amount of training data and label noise than standard POS taggers.
Finally, we introduce a novel model, a bi-LSTM trained with auxiliary loss.The model jointly predicts the POS and the log frequency of the next word.The intuition behind this model is that the auxiliary loss, being predictive of word frequency, helps to differentiate the representations of rare and common words.We indeed observe performance gains on rare and out-of-vocabulary words.These performance gains transfer into general improvements for morphologically rich languages.

Contributions
In this paper, we a) evaluate the effectiveness of different representations in bi-LSTMs, b) compare these models across a large set of languages and under varying conditions (data size, label noise) and c) propose a novel bi-LSTM model with auxiliary loss (LOGFREQ).

arXiv:1604.05529v2 [cs.CL] 19 May 2016
2 Tagging with bi-LSTMs Recurrent neural networks (RNNs) (Elman, 1990) allow the computation of fixed-size vector representations for word sequences of arbitrary length.An RNN is a function that reads in n vectors x 1 , ..., x n and produces an output vector h n , that depends on the entire sequence x 1 , ..., x n .The vector h n is then fed as an input to some classifier, or higher-level RNNs in stacked/hierarchical models.The entire network is trained jointly such that the hidden representation captures the important information from the sequence for the prediction task.
A bidirectional recurrent neural network (bi-RNN) (Graves and Schmidhuber, 2005) is an extension of an RNN that reads the input sequence twice, from left to right and right to left, and the encodings are concatenated.The literature uses the term bi-RNN to refer to two related architectures, which we refer to here as "context bi-RNN" and "sequence bi-RNN".In a sequence bi-RNN (bi-RNN seq ), the input is a sequence of vectors x 1:n and the output is a concatenation (•) of a forward (f ) and reverse (r) RNN each reading the sequence in a different directions: In a context bi-RNN (bi-RNN ctx ), we get an additional input i indicating a sequence position, and the resulting vectors v i result from concatenating the RNN encodings up to i: Thus, the state vector v i in this bi-RNN encodes information at position i and its entire sequential context.Another view of the context bi-RNN is of taking a sequence x 1:n and returning the corresponding sequence of state vectors v 1:n .
LSTMs (Hochreiter and Schmidhuber, 1997) are a variant of RNNs that replace the cells of RNNs with LSTM cells that were designed to prevent vanishing gradients.Bidirectional LSTMs are the bi-RNN counterpart based on LSTMs.
Our basic bi-LSTM tagging model is a context bi-LSTM taking as input word embeddings w.We incorporate subtoken information using an hierarchical bi-LSTM architecture (Ling et al., 2015;Ballesteros et al., 2015).We compute subtokenlevel (either characters c or unicode byte b) embeddings of words using a sequence bi-LSTM at the lower level.This representation is then concatenated with the (learned) word embeddings vector w which forms the input to the context bi-LSTM at the next layer.This model, illustrated in Figure 1 (lower part in left figure), is inspired by Ballesteros et al. (2015).We also test models in which we only keep sub-token information, e.g., either both byte and character embeddings (Figure 1, right) or a single (sub-)token representation alone.In our novel model, cf. Figure 1 left, we train the bi-LSTM tagger to predict both the tags of the sequence, as well as a label that represents the log frequency of the next token as estimated from the training data.Our combined cross-entropy loss is now: L( ŷt , y t ) + L( ŷa , y a ), where t stands for a POS tag and a is the log frequency label, i.e., a = int(log(f req train (w)).Combining this log frequency objective with the tagging task can be seen as an instance of multi-task learning in which the labels are predicted jointly.The idea behind this model is to make the representation predictive for frequency, which encourages the model to not share representations between common and rare words, thus benefiting the handling of rare tokens.

Experiments
All bi-LSTM models were implemented in CNN/pycnn, 1 a flexible neural network library.For all models we use the same hyperparameters, which were set on English dev, i.e., SGD training with cross-entropy loss, no mini-batches, 20 epochs, default learning rate (0.1), 128 dimensions for word embeddings, 100 for character and byte embeddings, 100 hidden states and Gaussian noise with σ=0.2.As training is stochastic in nature, we use a fixed seed throughout.Embeddings are not initialized with pre-trained embeddings, except when reported otherwise.In that case we use offthe-shelf polyglot embeddings (Al-Rfou et al., 2013). 2 No further unlabeled data is considered in this paper.The code is released at: https: //github.com/bplank/bilstm-auxTaggers We want to compare POS taggers under varying conditions.We hence use three different types of taggers: our implementation of a bi-LSTM; TNT (Brants, 2000)-a second order HMM with suffix trie handling for OOVs.We use TNT as it was among the best performing taggers evaluated in Horsmann et al. (2015). 3We complement the NN-based and HMM-based tagger with a CRF tagger, using a freely available implementation (Plank et al., 2014) based on crfsuite.

Datasets
For the multilingual experiments, we use the data from the Universal Dependencies project v1.2 (Nivre et al., 2015) (17 POS) with the canonical data splits.For languages with token segmentation ambiguity we use the provided gold segmentation.If there is more than one treebank per language, we use the treebank that has the canonical language name (e.g., Finnish instead of Finnish-FTB).We consider all languages that have at least 60k tokens and are distributed with word forms, resulting in 22 languages.We also report accuracies on WSJ (45 POS) using the standard splits (Collins, 2002;Manning, 2011).The overview of languages is provided in Table 1.

Results
Our results are given in Table 2. First of all, notice that TNT performs remarkably well across the 22 languages, closely followed by CRF.The bi-LSTM tagger ( w) without lower-level bi-LSTM for subtokens falls short, outperforms the traditional taggers only on 3 languages.The bi-LSTM 2 https://sites.google.com/site/rmyeid/projects/polyglot 3 They found TreeTagger was closely followed by Hun-Pos, a re-implementation of TnT, and Stanford and ClearNLP were lower ranked.In an initial investigation, we compared Tnt, HunPos and TreeTagger and found Tnt to be consistently better than Treetagger, Hunpos followed closely but crashed on some languages (e.g., Arabic).2016) (last column).However, note that these results are not strictly comparable as they use the earlier UD v1.1 version.The overall best system is the multi-task bi-LSTM FREQBIN (it uses w + c and POLYGLOT initialization for w).While on macro average it is on par with bi-LSTM w + c, it obtains the best results on 12/22 languages, and it is successful in predicting POS for OOV tokens (cf.Table 2 OOV ACC columns), especially for languages like Arabic, Farsi, Hebrew, Finnish.
We examined simple RNNs and confirm the finding of Ling et al. (2015) that they performed worse than their LSTM counterparts.Finally, the bi-LSTM tagger is competitive on WSJ, cf.Table 3.

Rare words
In order to evaluate the effect of modeling sub-token information, we examine accuracy rates at different frequency rates.Figure 2 shows absolute improvements in accuracy of bi-LSTM w + c over mean log frequency, for different language families.We see that especially for Slavic and non-Indoeuropean languages, having high morphologic complexity, most of the improvement is obtained in the Zipfian tail.Rare tokens benefit from the sub-token representations.Data set size Prior work mostly used large data sets when applying neural network based approaches (Zhang et al., 2015).We evaluate how brittle such models are with respect to their more traditional counterparts by training bi-LSTM ( w + c without Polyglot embeddings) for increas-
The bi-LSTM model performs already surprisingly well after only 500 training sentences.For non-Indoeuropean languages it is on par and above the other taggers with even less data (100 sentences).This shows that the bi-LSTMs often needs more data than the generative markovian model, but this is definitely less than what we expected.
Label Noise We investigated the susceptibility of the models to noise, by artificially corrupting training labels.Our initial results show that at low noise rates, bi-LSTMs and TNT are affected similarly, their accuracies drop to a similar degree.Only at higher noise levels (more than 30% corrupted labels), bi-LSTMs are less robust, showing higher drops in accuracy compared to TNT.This is the case for all investigated language families.

Related Work
Character embeddings were first introduced by Sutskever et al. (2011) for language modeling.Early applications include text classification (Chrupała, 2013;Zhang et al., 2015).Recently, these representations were successfully applied to a range of structured prediction tasks.For POS tagging, Santos and Zadrozny (2014) were the first to propose character-based models.They use a convolutional neural network (CNN; or convnet) and evaluated their model on English (PTB) and Portuguese, showing that the model achieves state-of-the-art performance close to taggers using carefully designed feature templates.Ling et al. (2015) extend this line and compare a novel bi-LSTM model, learning word representations through character embeddings.They evaluate their model on a language modeling and POS tagging setup, and show that bi-LSTMs outperform the CNN approach of Santos and Zadrozny (2014).Similarly, Labeau et al. (2015) evaluate character embeddings for German.Bi-LSTMs for POS tagging are also reported in Wang et al. (2015), however, they only explore word embeddings, orthographic information and evaluate on WSJ only.A related study is Cheng et al. (2015) who propose a multi-task RNN for named entity recognition by jointly predicting the next token and current token's name label.Our model is simpler, it uses a very coarse set of labels rather then integrating an entire language modeling task which is computationally more expensive.An interesting recent study is Gillick et al. (2016), they build a single byte-to-span model for multiple languages based on a sequence-to-sequence RNN (Sutskever et al., 2014) achieving impressive results.We would like to extend this work in their direction.

Conclusions
We evaluated token and subtoken-level representations for neural network-based part-of-speech tagging across 22 languages and proposed a novel multi-task bi-LSTM with auxiliary loss.The auxiliary loss is effective at improving the accuracy of rare words.Subtoken representations are necessary to obtain a state-of-the-art POS tagger, and character embeddings are particularly helpful for non-Indoeuropean and Slavic languages.
Combining them with word embeddings in a hierarchical network provides the best representation.The bi-LSTM tagger is as effective as the CRF and HMM taggers with already as little as 500 training sentences, but is less robust to label noise (at higher noise rates).

Figure 1 :
Figure 1: Right: bi-LSTM, illustrated with b + c (bytes and characters), for w + c replace b with words w.Left: FREQBIN, our multi-task bi-LSTM that predicts at every time step the tag and the frequency class for the next token.

Figure 2 :
Figure 2: Absolute improvements of bi-LSTM ( w + c) over TNT vs mean log frequency.

Figure 3 :
Figure 3: Amount of training data (number of sentences) vs tagging accuracy.