Robust Multilingual Part-of-Speech Tagging via Adversarial Training

Adversarial training (AT) is a powerful regularization method for neural networks, aiming to achieve robustness to input perturbations. Yet, the specific effects of the robustness obtained from AT are still unclear in the context of natural language processing. In this paper, we propose and analyze a neural POS tagging model that exploits AT. In our experiments on the Penn Treebank WSJ corpus and the Universal Dependencies (UD) dataset (27 languages), we find that AT not only improves the overall tagging accuracy, but also 1) prevents over-fitting well in low resource languages and 2) boosts tagging accuracy for rare / unseen words. We also demonstrate that 3) the improved tagging performance by AT contributes to the downstream task of dependency parsing, and that 4) AT helps the model to learn cleaner word representations. 5) The proposed AT model is generally effective in different sequence labeling tasks. These positive results motivate further use of AT for natural language tasks.


Introduction
Recently, neural network-based approaches have become popular in many natural language processing (NLP) tasks including tagging, parsing, and translation (Chen and Manning, 2014;Bahdanau et al., 2015;Ma and Hovy, 2016).However, it has been shown that neural networks tend to be locally unstable and even tiny perturbations to the original inputs can mislead the models (Szegedy et al., 2014).Such maliciously perturbed inputs are called adversarial examples.Adversarial training (Goodfellow et al., 2015) aims to improve the robustness of a model to input perturbations by training on both unmodified examples and adversarial examples.Previous work (Goodfellow Figure 1: Illustration of our architecture for adversarial POS tagging.Given a sentence, we input the normalized word embeddings (w 1 , w 2 , w 3 ) and character embeddings (showing c 1 , c 2 , c 3 for w 1 ).Each word is represented by concatenating its word embedding and its character-level BiLSTM output.They are fed into the main BiLSTM-CRF network for POS tagging.In adversarial training, we compute and add the worst-case perturbation η to all the input embeddings for regularization.et al., 2015;Shaham et al., 2015) on image recognition has demonstrated the enhanced robustness of their models to unseen images via adversarial training and has provided theoretical explanations of the regularization effects.
Despite its potential as a powerful regularizer, adversarial training (AT) has yet to be explored extensively in natural language tasks.Recently, Miyato et al. (2017) applied AT on text classification, achieving state-of-the-art accuracy.Yet, the specific effects of the robustness obtained from AT are still unclear in the context of NLP.For example, research studies have yet to answer questions such as 1) how can we interpret perturbations or robustness on natural language inputs?2) how are they related to linguistic factors like vocabulary statistics? 3) are the effects of AT language-dependent?Answering such questions is crucial to understand and motivate the application of adversarial training on natural language tasks.
In this paper, spotlighting a well-studied core problem of NLP, we propose and carefully analyze a neural part-of-speech (POS) tagging model that exploits adversarial training.With a BiLSTM-CRF model (Huang et al., 2015;Ma and Hovy, 2016) as our baseline POS tagger, we apply adversarial training by considering perturbations to input word/character embeddings.In order to demystify the effects of adversarial training in the context of NLP, we conduct POS tagging experiments on multiple languages using the Penn Treebank WSJ corpus (Englsih) and the Universal Dependencies dataset (27 languages), with thorough analyses of the following points: • Effects on different target languages • Vocabulary statistics and tagging accuracy • Influence on downstream tasks • Representation learning of words In our experiments, we find that our adversarial training model consistently outperforms the baseline POS tagger, and even achieves state-of-the-art results on 22 languages.Furthermore, our analyses reveal the following insights into adversarial training in the context of NLP: • The regularization effects of adversarial training (AT) are general across different languages.
AT can prevent overfitting especially well when training examples are scarce, providing an effective tool to process low resource languages.
• AT can boost the tagging performance for rare/ unseen words and increase the sentence-level accuracy.This positively affects the performance of down-stream tasks such as dependency parsing, where low sentence-level POS accuracy can be a bottleneck (Manning, 2011).
• AT helps the network learn cleaner word embeddings, showing stronger correlations with their POS tags.
We argue that the effects of AT can be interpreted from the perspective of natural language.Finally, we demonstrate that the proposed AT model is generally effective across different sequence labeling tasks.This work therefore provides a strong motivation and basis for utilizing adversarial training in NLP tasks.
2 Related Work

POS Tagging
Part-of-speech (POS) tagging is a fundamental NLP task that facilitates downstream tasks such as syntactic parsing.While current state-of-theart POS taggers (Ling et al., 2015;Ma and Hovy, 2016) yield accuracy over 97.5% on PTB-WSJ, there still remain issues.The per token accuracy metric is easy since taggers can easily assign correct POS tags to highly unambiguous tokens, such as punctuation (Manning, 2011).Sentence-level accuracy serves as a more realistic metric for POS taggers but it still remains low.Another problem with current POS taggers is that their accuracy deteriorates drastically on low resource languages and rare words (Plank et al., 2016).In this work, we demonstrate that adversarial training (AT) can mitigate these issues.
It is empirically shown that POS tagging performance can greatly affect downstream tasks such as dependency parsing (Dozat et al., 2017).In this work, we also demonstrate that the improvements obtained from our AT POS tagger actually contribute to dependency parsing.Nonetheless, parsing with gold POS tags still yields better results, bolstering the view that POS tagging is an essential task in NLP that needs further development.

Adversarial Training
The concept of adversarial training (Szegedy et al., 2014;Goodfellow et al., 2015) was originally introduced in the context of image classification to improve the robustness of a model by training on input images with malicious perturbations.Previous work (Goodfellow et al., 2015;Shaham et al., 2015;Wang et al., 2017) has provided a theoretical framework to understand adversarial examples and the regularization effects of adversarial training (AT) in image recognition.
Recently, Miyato et al. (2017) applied AT to a natural language task (text classification) by extending the concept of adversarial perturbations to word embeddings.Wu et al. (2017) further explored the possibility of AT in relation extraction.Both report improved performance on their tasks via AT, but the specific effects of AT have yet to be analyzed.In our work, we aim to address this issue by providing detailed analyses on the effects of AT from the perspective of NLP, such as different languages, vocabulary statistics, word embedding distribution, and aim to motivate future research that exploits AT in NLP tasks.
AT is related to other regularization methods that add noise to data such as dropout (Srivastava et al., 2014) and its variant for NLP tasks, word dropout (Iyyer et al., 2015).Xie et al. (2017) discuss various data noising techniques for language modeling.While these methods produce random noise, AT generates perturbations that the current model is particularly vulnerable to, and thus is claimed to be effective (Goodfellow et al., 2015).
It should be noted that while related in name, adversarial training (AT) differs from Generative Adversarial Networks (GANs) (Goodfellow et al., 2014).GANs have already been applied to NLP tasks such as dialogue generation (Li et al., 2017) and transfer learning (Kim et al., 2017;Gui et al., 2017).Adversarial training also differs from adversarial evaluation, recently proposed for reading comprehension tasks (Jia and Liang, 2017).

Method
In this section, we introduce our baseline POS tagging model and explain how we implement adversarial training on top.

Baseline POS Tagging Model
Following the recent top-performing models for sequence labeling tasks (Plank et al., 2016;Lample et al., 2016;Ma and Hovy, 2016), we employ a Bi-directional LSTM-CRF model as our baseline (see Figure 1 for an illustration).
Character-level BiLSTM.Prior work has shown that incorporating character-level representations of words can boost POS tagging accuracy by capturing morphological information present in each language.Major neural character-level models include the character-level CNN (Ma and Hovy, 2016) and (Bi)LSTM (Dozat et al., 2017).A Bi-directional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) processes each sequence both forward and backward to capture sequential information, while preventing the vanishing / exploding gradient problem.We observed that the character-level BiLSTM outperformed the CNN by 0.1% on the PTB-WSJ development set, and hence in all of our experiments we use the character-level BiLSTM.Specifically, we generate a character-level representation for each word by feeding its character embeddings into the BiLSTM and obtaining the concatenated final states.
Word-level BiLSTM.Each word in a sentence is represented by concatenating its word embedding and its character-level representation.They are fed into another level of BiLSTM (word-level BiLSTM) to process the entire sentence.

CRF.
In sequence labeling tasks it is beneficial to consider the correlations between neighboring labels and jointly decode the best chain of labels for a given sentence.With this motivation, we apply a conditional random field (CRF) (Lafferty et al., 2001) on top of the word-level BiLSTM to perform POS tag inference with global normalization, addressing the "label bias" problem.Specifically, given an input sentence, we pass the output sequence of the word-level BiLSTM to a firstorder chain CRF to compute the conditional probability of the target label sequence: where θ represents all of the model parameters (in the BiLSTMs and CRF), s and y denote the input embeddings and the target POS tag sequence, respectively, for the given sentence.
For training, we minimize the negative loglikelihood (loss function) with respect to the model parameters.Decoding searches for the POS tag sequence y * with the highest conditional probability using the Viterbi algorithm.For more detail about the BiLSTM-CRF formulation, refer to Ma and Hovy (2016).

Adversarial Training
Adversarial training (Goodfellow et al., 2015) is a powerful regularization method, primarily explored in image recognition to improve the robustness of classifiers to input perturbations.Given a classifier, we first generate input examples that are very close to original inputs (so should yield the same labels) yet are likely to be misclassified by the current model.Specifically, these adversarial examples are generated by adding small perturbations to the inputs in the direction that significantly increases the loss function of the classifier (worstcase perturbations).Then, the classifier is trained on the mixture of clean examples and adversarial examples to improve the stability to input perturbations.In this work, we incorporate adversarial training into our baseline POS tagger, aiming to achieve better regularization effects and to provide their interpretations in the context of NLP.
Generating adversarial examples.Adversarial training (AT) considers continuous perturbations to inputs, so we define perturbations at the level of dense word / character embeddings rather than one-hot vector representations, similarly to Miyato et al. (2017).Specifically, given an input sentence, we consider the concatenation of all the word / character embeddings in the sentence: To prepare an adversarial example, we aim to generate the worst-case perturbation of a small bounded norm that maximizes the loss function L of the current model: where θ is the current value of the model parameters, treated as a constant, and y denotes the target labels.Since the exact computation of such η is intractable in complex neural networks, we employ the Fast Gradient Method (Liu et al., 2017;Miyato et al., 2017) i.e. first order approximation to obtain an approximate worst-case perturbation of norm , by a single gradient computation: is a hyperparameter to be determined in the development dataset.Note that the perturbation η is generated in the direction that significantly increases the loss L. We find such η against the current model parameterized by θ, at each training step, and construct an adversarial example by s adv = s + η However, if we do not restrict the norm of word / character embeddings, the model could trivially learn embeddings of large norms to make the perturbations insignificant.To prevent this issue, we normalize word/character embeddings so that they have mean 0 and variance 1 for every entry, as in Miyato et al. (2017).The normalization is performed every time we feed input embeddings into the LSTMs and generate adversarial examples.To ensure a fair comparison, we also normalize input embeddings in our baseline model.
While Miyato et al. (2017) set the norm of a perturbation (Eq 2) to be a fixed value for all input sentences, to generate adversarial examples for an entire sentence of a variable length and to include character embeddings besides word embeddings, we make the perturbation size adaptive to the dimension of the concatenated input embedding s ∈ R D .We set to be α √ D (i.e., proportional to √ D), as the expected squared norm of s after the embedding normalization is D. The scaling factor α is selected from {0.001, 0.005, 0.01, 0.05, 0.1} based on the development performance in each treebank.We used 0.01 for PTB-WSJ and UD-Spanish, and 0.05 for the rest.Note that α = 0 would generate no noise (identical to the baseline); if α = 1, the generated adversarial perturbation would have a norm comparable to the original embedding, which could change the semantics of the input sentence (Wu et al., 2017).Hence, the optimal perturbation scale α should lie in between and be small enough to preserve the semantics of the original input.where L(θ; s, y), L(θ; s adv , y) represent the loss from a clean example and the loss from its adversarial example, respectively, and γ determines the weighting between them.We used γ = 0.5 in all our experiments.This objective function can be optimized with respect to the model parameters θ, in the same manner as the baseline model.

Experiments
To fully analyze the effects of adversarial training, we train and evaluate our baseline/adversarial POS tagging models on both a standard English dataset and a multilingual dataset.

Datasets
As a standard English dataset, we use the Wall Street Journal (WSJ) portion of the Penn Treebank (PTB) (Marcus et al., 1993), containing 45 different POS tags.We adopt the standard split: sections 0-18 for training, 19-21 for development and 22-24 for testing (Collins, 2002;Manning, 2011).For multilingual POS tagging experiments, to compare with prior work, we use treebanks from Universal Dependencies (UD) v1.2 (Nivre et al., 2015) (17 POS) with the given data splits.We experiment on languages for which pre-trained Polyglot word embeddings (Al-Rfou et al., 2013) Ma and Hovy (2016).We apply dropout (Srivastava et al., 2014) to input embeddings and BiLSTM outputs for both baseline and adversarial training, with dropout rate 0.5.
Optimization.We train the model parameters and word/character embeddings by the mini-batch stochastic gradient descent (SGD) with batch size 10, momentum 0.9, initial learning rate 0.01 and decay rate 0.05.We also use a gradient clipping of 5.0 (Pascanu et al., 2012).The models are trained with early stopping (Caruana et al., 2001) based on the development performance.
Evaluation.We evaluate per token tagging accuracy on test sets.We repeat the experiment three times and report the statistical significance.

Results
PTB-WSJ dataset.English (WSJ) training of word rarity may be of particular help in processing morphologically complex words.Additionally, we see that our AT model achieves notably large improvements over the baseline in resource-poor languages (the bottom of Table 2), with average improvement 0.35%, as compared to that for resource-rich languages, 0.20%.To further visualize the regularization effects, we present the learning curves for three representative languages, English (WSJ), French (UD-fr) and Romanian (UD-ro, low-resource), based on the development loss (see Figure 2).For all the three languages, we can observe that the AT model (red solid line) prevents overfitting better than the baseline (black dotted line), and this advantage is more significant in low resource languages.For example, in Romanian, the baseline model starts to increase development loss after 1,000 iterations even with dropout, whereas the AT model keeps improving until 2,500 iterations, achieving notably lower development loss (0.4 down).These results illustrate that AT can prevent overfitting especially well on small datasets and can augment the regularization power beyond dropout.AT can also be viewed as an effective means of data augmenta-English (WSJ) tion, where we generate and train with new examples the current model is particularly vulnerable to at every time step, enhancing the robustness of the model.AT can therefore be a promising tool to process low resource languages.

Analysis
In the previous sections, we demonstrated the regularization power of adversarial training (AT) on different languages, based on the overall POS tagging performance and learning curves.In this section, we conduct further analyses on the robustness of AT from NLP specific aspects such as word statistics, sequence modeling, downstream tasks, and word representation learning.We find that AT can boost tagging accuracy on rare words and neighbors of unseen words ( §5.1).Furthermore, this robustness against rare / unseen words leads to better sentence-level accuracy and downstream dependency parsing ( §5.2).We illustrate these findings using two major languages, English (WSJ) and French (UD), which have substantially large training and testing data to discuss vocabulary statistics and sentence-level performance.Finally, we study the effects of AT on word representation learning ( §5.3), and the applicability of AT to different sequential tasks ( §5.4).

Word-level Analysis
Poor tagging accuracy on rare/unseen words is one of the bottlenecks in current POS taggers (Manning, 2011;Plank et al., 2016).Aiming to reveal the effects of AT on rare / unseen words, we analyze tagging performance at the word level, considering vocabulary statistics.
Word frequency.To define rare / unseen words, we consider each word's frequency of occurrence in the training set.We categorize all words in the test set based on this frequency and study the test tagging accuracy for each group (see Table 3). 3In both languages, the AT model achieves large improvements over the baseline on rare words (e.g., frequency 1-10 in training), as opposed to more frequent words.This result again corroborates the data augmentation power of AT under small training examples.On the other hand, we did not observe meaningful improvements on unseen words (frequency 0 in training).A possible explanation is that AT can facilitate the learning of words with at least a few occurrences in training (rare words), but is not particularly effective in inferring the POS tags of words for which no training examples are given (unseen words).
Neighboring words.One important characteristic of natural language tasks is the sequential nature of inputs (i.e., sequence of words), where each word influences the function of its neighboring words.Since our model uses BiLSTM-CRF for that reason, we also study the tagging performance on the neighbors of rare/unseen words, and analyze the effects of AT with the sequence model in mind.In Table 4, we cluster all words in the test set based on their frequency in training again, and consider the tagging accuracy on the neighbors (left and right) of these words in the test text.We observe that AT tends to achieve large improvements over the baseline on the neighbors of unseen words (training frequency 0), while the improvements on the neighbors of more frequent words remain moderate.Our AT model thus exhibits strong stability to uncertain neighbors, as compared to the baseline.We suspect that because we generate adversarial examples against entire input sentences, training with adversarial examples makes the model more robust not only to perturbations in each word but also to perturbations in its neighbor-  ing words, leading to greater stability to uncertain neighbors.

Sentence-level & Downstream Analysis
In the word-level analysis, we showed that AT can boost tagging accuracy on rare words and the neighbors of unseen words, enhancing overall robustness on rare/unseen words.In this section, we discuss the benefit of our improved POS tagger in a major downstream task, dependency parsing.Most of the recent state-of-the-art dependency parsers take predicted POS tags as input (e.g.Chen and Manning (2014); Andor et al. (2016); Dozat and Manning (2017)).Dozat et al. (2017) empirically show that their dependency parser gains significant improvements by using POS tags predicted by a Bi-LSTM POS tagger, while POS tags predicted by the UDPipe tagger (Straka et al., 2016) do not contribute to parsing performance as much.This observation illustrates that POS tagging performance has a great influence on dependency parsing, motivating the hypothesis that the POS tagging improvements gained from our adversarial training help dependency parsing.
To test the hypothesis, we consider three settings in dependency parsing of English and French: using POS tags predicted by the baseline model, using POS tags predicted by the AT model, and using gold POS tags.For English (PTB-WSJ), we first convert the treebank into Stanford Dependencies (SD) using Stanford CoreNLP (ver 3.8.0)(Manning et al., 2014), and then apply two wellknown dependency parsers: Stanford Parser (ver 3.5.0)(Chen and Manning, 2014) and Parsey Mc-Parseface (SyntaxNet) (Andor et al., 2016) and pre-trained on corresponding treebanks.
Table 5 shows the results of the experiments.We can observe improvements in both languages by using the POS tags predicted by our AT POS tagger.As Manning (2011) points out, when predicted POS tags are used for downstream dependency parsing, a single bad mistake in a sentence can greatly damage the usefulness of the POS tagger.The robustness of our AT POS tagger against rare/unseen words helps to mitigate such an issue.This advantage can also be observed from the AT POS tagger's notably higher sentence-level accuracy than the baseline (see Table 5 left).Nonetheless, gold POS tags still yield better parsing results as compared to the baseline/AT POS taggers, supporting the claim that POS tagging needs further improvement for downstream tasks.

Effects on Representation Learning
Next, we perform an analysis on representation learning of words (word embeddings) for the English (PTB-WSJ) and French (UD) experiments.We hypothesize that adversarial training (AT) helps to learn better word embeddings so that the POS tag prediction of a word cannot be influenced by a small perturbation in the input embedding.
To verify this hypothesis, we cluster all words in the test set based on their correct POS tags 4 and evaluate the tightness of the word vector distribution within each cluster.We compare this clustering quality among the three settings: 1) beginning (initialized with GloVe or Polyglot), 2) after baseline training (50 epochs), and 3) after adversarial training (50 epochs), to study the effects of AT on word representation learning.
For evaluating the tightness of word vector distribution, we employ the cosine similarity metric, which is widely used as a measure of the closeness between two word vectors (e.g., Mikolov et al. (2013); Pennington et al. (2014)).To measure the tightness of each cluster, we compute the cosine similarity for every pair of words within, and then take the average.We also report the average tightness across all the clusters.
The evaluation results are summarized in Table 6.We report the tightness scores for the four major clusters: noun, verb, adjective, and adverb (from left to right).As can be seen from the table, for both languages, adversarial training (AT) results in cleaner word embedding distributions than the baseline, with a higher cosine similarity within each POS cluster, and with a clear advantage in the average tightness across all the clusters.In other words, the learned word vectors show stronger correlations with their POS tags.This result confirms that training with adversarial examples can help to learn cleaner word embeddings so that the meaning / grammatical function of a word cannot be altered by a small perturbation in its embedding.This analysis provides a means to interpret the robustness to input perturbations, from the perspective of NLP.
Relation with perturbation size .We also study how the size of added perturbations influences word representation learning in adversarial training.Recall that we set the norm of a perturbation to be α √ D, where D is the dimension of the concatenated input embeddings (see §3.2).For instance, α = 0 would produce no noise; α = 1 would generate a perturbation of a norm equivalent to the original word embeddings.We hypothesize that AT facilitates word representation learning when α is small enough to preserve the semantics of input words, but can hinder the learning when α is too large.To test the hypothesis, we repeat the clustering evaluation for word embeddings trained with varied perturbation scale α: 0, 0.001, 0.01, 0.05, 0.1, 0.5 (see Table 7).We observe that the quality of learned word embedding distribution keeps improving as α goes up from 0 to 0.1, but starts to drop around α = 0.5.
We also find that this optimal α in word embedding learning (i.e., 0.1) is larger than the α which yielded the best tagging performance on development sets (i.e., 0.01 or 0.05).A possible explanation is that while word embeddings can adapt to relatively large α (e.g., 0.1) during training, as adversarial perturbations are generated at the embedding level, such α could change the semantics of the input from the current tagging model's perspective and hinder the training of tagging.

Other Sequence Labeling Tasks
Finally, to further confirm the applicability of AT, we experiment with our BiLSTM-CRF AT model in different sequence labeling tasks: chunking and named entity recognition (NER).
Chunking can be performed as a sequence labeling task that assigns a chunking tag (B-NP, I-VP, etc.) to each word.We conduct experiments on the CoNLL 2000 shared task with the standard data split: PTB-WSJ Sections 15-18 for training and 20 for testing.We use Section 19 as the development set and employ the IOBES tagging scheme, following Hashimoto et al. (2017).NER aims to assign an entity type to each word, such as person, location, organization, and misc.
The results are summarized in Table 8 and 9. AT enhanced F1 score from the baseline BiLSTM-CRF model's 95.18 to 95.25 for chunking, and from 91.22 to 91.56 for NER, also significantly outperforming Ma and Hovy (2016).These improvements made by AT are bigger than that for English POS tagging, most likely due to the larger room for improvement in chunking and NER.The improvements are again statistically significant, with p-value < 0.05 on the t-test.The experimental results suggest that the proposed adversarial training scheme is generally effective across different sequence labeling tasks.
Our BiLSTM-CRF AT model did not reach the performance by Hashimoto et al. (2017)'s multitask model and Peters et al. ( 2017)'s state-of-theart system that incorporates pretrained language models.It would be interesting future work to combine the strengths of these joint models (e.g., syntactic and semantic aids) and adversarial training (e.g., robustness).

Conclusion
We proposed and carefully analyzed a POS tagging model that exploits adversarial training (AT).In our multilingual experiments, we find that AT achieves substantial improvements on all the languages tested, especially on low resource ones.AT also enhances the robustness to rare/unseen words and sentence-level accuracy, alleviating the major issues of current POS taggers, and contributing to the downstream task, dependency parsing.Furthermore, our analyses on different languages, word / neighbor statistics and word representation learning reveal the effects of AT from the perspective of NLP.The proposed AT model is applicable to general sequence labeling tasks.This work therefore provides a strong basis and motivation for utilizing AT in natural language tasks.

Figure 2 :
Figure 1: 3 Adversarial training.At each training step, we generate adversarial examples against the current model, and train on the mixture of clean examples and adversarial examples to achieve robustness to input perturbations.To this end, we define the loss function for adversarial training as: L = γL(θ; s, y) + (1 − γ)L(θ; s adv , y) are available, resulting in 27 languages listed in Table 2.We regard languages with less than 60k tokens of training data as low-resource (Table 2, bottom), as in Plank et al. (2016).

Table 1
cept Ling et al. (2015).The improvement over the baseline is statistically significant, with p-value < 0.05 on the t-test.We provide additional analysis on this result in later sections.

Table 1 :
POS tagging accuracy (test) for 27 UD v1.2 treebanks.The first column shows languages and the rest show tagging accuracy of different models.For Plank et al. (2016), we include the traditional baselines TNT and CRF, and their state-

Table 2 :
POS tagging accuracy (test) for 27 UD v1.2 treebanks, with other recent works, Plank et al. (2016), Berend (2017) and Nguyen et al. (2017).For Plank et al. (2016), we include the traditional baselines TNT and CRF, and their stateof-the-art model that employs a multi-task BiL-STM.Languages with • are morphologically rich, and those at the bottom ('el' to 'ta') are lowresource, containing less than 60k tokens in their training sets.•).2We suspect that their joint

Table 3 :
POS tagging accuracy (test) on different subsets of words, categorized by their frequency of occurrence in training.The second row shows the number of tokens in the test set that are in each category.The third and fourth rows show the performance of our two models.Better scores are underlined.The biggest improvement is in bold.

Table 4 :
POS tagging accuracy (test) on neighboring words.We cluster all words in the test set in the same way as Table3and consider the tagging performance on the neighbors (left and right) of these words in the test text.

Table 5 :
Sentence-level accuracy and downstream dependency parsing performance by our baseline/ adversarial POS taggers.

Table 7 :
. For French (UD), we use Parsey Universal from Syn-taxNet.The three parsers are all publicly available Average cluster tightness for word embeddings trained with varied perturbation scale α (0 indicates baseline training).

Table 8 :
Chunking F1 scores on the CoNLL-2000 task, with other top performing models.

Table 9 :
NER F1 scores on the CoNLL-2003 (English) task, with other top performing models.