Adversarial Training for Cross-Domain Universal Dependency Parsing

We describe our submission to the CoNLL 2017 shared task, which exploits the shared common knowledge of a language across different domains via a domain adaptation technique. Our approach is an extension to the recently proposed adversarial training technique for domain adaptation, which we apply on top of a graph-based neural dependency parsing model on bidirectional LSTMs. In our experiments, we find our baseline graph-based parser already outperforms the official baseline model (UDPipe) by a large margin. Further, by applying our technique to the treebanks of the same language with different domains, we observe an additional gain in the performance, in particular for the domains with less training data.


Introduction
In the CoNLL 2017 shared task (Zeman et al., 2017), some language data is available in more than one treebanks typically from different annotation projects. While the treebanks differ in many respects such as the genre and the source of the text (i.e., original or translated text), the most notable difference is that the size of the treebanks often varies significantly. For example, there are three variants of English treebanks: en, en lines, and en parunt, in which the largest dataset en contains 12,543 training sentences while en lines and en parunt contain only 2,738 and 1,090 sentences, respectively.
In this paper, we describe our approach to improve the parser performance for the treebanks with lesser training data (e.g., en lines and en parunt), by jointly learning with the dominant treebank of the same language (e.g, en). We formulate our approach as a kind of domain adaptation, in which we treat the dominant treebank as the source domain while the others as the target domains.
Our approach to domain adaptation, which we call SharedGateAdvNet, is an extension to the recently proposed neural architecture for domain adaptation (Ganin and Lempitsky, 2015) with adversarial training (Goodfellow et al., 2014), which learns domain-invariant feature representations through an adversarial domain classifier. We extend this architecture with an additional neural layer for each domain, which captures domainspecific feature representations. To our knowledge this is the first study to apply the adversarial training-based domain adaptation to parsing.
We utilize this architecture to obtain the representation of each token of a sentence, and feed it into a graph-based dependency parsing model where each dependency arc score is calculated using bilinear attention (Dozat and Manning, 2017). Specifically, we obtain the domainspecific and domain-invariant feature representations for each token via separate bidirectional LSTMs (Bi-LSTMs), and then combine them via a gated mechanism.
Our baseline method is our reimplementation of the graph-based dependency parser with LSTMs (Dozat and Manning, 2017) trained with a single treebank. First, we observe that this model is already much stronger than the official baseline model of UDPipe (Straka et al., 2016) in most treebanks. We then apply our domain adaptation technique to the set of treebanks of the same language, and in most cases we observe a clear improvement of the scores, especially for the treebanks with lesser training data. We also try our architecture across multiple languages, i.e., a high-resource language with a large treebank, such as English, and a low-resource language with a small data set. Interestingly, even though the mixed languages are completely different, we observe some score improvements in low-resource languages with this approach. Finally we rank the 6th place on the main result of the shared task.

System overview
The CoNLL 2017 shared task aims at parsing Universal Dependencies 2.0 (Nivre et al., 2017). While its concern is parsing in the wild, i.e., from the raw input text, in this work we focus only on the dependency parsing layer that receives the tokenized and POS tagged input text. For all preprocessing from sentence splitting to POS tagging, we use the provided UDPipe pipeline. For obtaining the training treebanks, we keep the gold word segmentation while assign the predicted POS tags with the UDPipe. 1 We did a simple model selection with the development data for choosing the final submitted models. First, we trained our baseline graphbased LSTM parsing model (Section 3) independently for each treebank. Then for some languages with more than one treebank, or low-resource languages with a small treebank alone, we applied our proposed domain adaptation technique (Section 4) and obtained additional models. For treebanks for which we trained several models, we selected the best performing model on the development set in terms of LAS. For the other treebanks, we submitted our baseline graph-based parser.
For the parallel test sets (e.g., en pub) with no training data, we use the model trained on the largest treebank of a target language. We did not pay much attention to the surprise languages. For Buryat (bxr), we just ran the model of Russian (ru). For the other three languages, we ran the model of English (en).

Biaffine Attention model
Our baseline model is the biaffine attention model (Dozat and Manning, 2017), which is an extension to the recently proposed dependency parsing method calculating the score of each arc independently from the representations of two tokens obtained by Bi-LSTMs (Kiperwasser and Goldberg, 2016). For labeled dependency parsing, this model first predicts the best unlabeled dependency structure, and then assigns a label to each predicted arc with another classifier. For the first step, receiving the word and POS tag sequences as an input, the model calculates the score of every possible dependency arc. To obtain a well-formed dependency tree, these scores are given to the maximum spanning tree (MST) algorithm (Pemmaraju and Skiena, 2003), which finds the tree with the highest total score of all arcs. The overview of the biaffine model is shown in Figure 1.
Let w t be the t-th token in the sentence. As an input the model receives the word embedding w t ∈ R d word and POS embedding p t ∈ R dpos for each w t , which are concatenated to a vector x t . This input is mapped by Bi-LSTMs to a hidden vector r t , which is then fed into an extension of bilinear transformation called a biaffine function to obtain the score for an arc from w i (head) to w j (dependent): where MLP is a multi layer perceptron. A weight matrix U (arc) determines the strength of a link from w i to w j while u (arc) is used in the bias term, which controls the prior headedness of w i . After obtaining the best unlabeled tree from these scores, we assign the best label for every arc according to s (label) i,j , in which the k-th element corresponds to the score of k-th label: where U (label) is a third-order tensor, W (label) is a weight matrix, and u (label) is a bias vector.

Domain Adaptation Techniques with Adversarial Training
Here we describe our network architectures for domain adaptation. We present two different net-
works both using adversarial training; the main difference between them is whether we use a domain-specific feature representation for each domain. The basic architecture with adversarial training (Section 4.1) is an application of the existing domain adaptation technique (Ganin and Lempitsky, 2015) that does not employ domainspecific representations. We then extend this architecture to add a domain-specific component with a gated mechanism (Section 4.2). In Section 5 we compare the empirical performances of these two approaches as well as several ablated settings.

Adversarial Training
Figure 2 describes the application of adversarial training (Ganin and Lempitsky, 2015) for the biaffine model. In this architecture all models for different domains are parameterized by the same LSTMs (Shared Bi-LSTMs), which output r t (Eqn. 1) that are fed into the biaffine model. The key of this approach is a domain classifier, which also receives r t and tries to classify the domain of the input. During training the classifier is trained to correctly classify the input domain. At the same time, we train the shared BiLSTM layers so that the domain classification task becomes harder. By this adversarial mechanism the model is encouraged to find the shared parameters that are not specific to a particular domain as much as possible. As a result, we expect the target domain model (with lesser training data) is trained to utilize the knowledge of the source domain effectively. Note that the domain classifier in Figure 2 is applied for every token in the input sentence.
This domain adaptation technique can be implemented by introducing the gradient reversal layer (GRL). The GRL has no parameters associated with it (apart from the hyper-parameter λ, which is not updated with backpropagation). During the forward propagation GRL acts as an identity transform, while during the backpropagation GRL takes the gradient from the subsequent layer, multiplies it by λ and passes it to the preceding layer.
The parameter λ controls the trade-off between the two objectives (the domain classifier and the biaffine model) that shape the feature representation during training. The GRL layer R λ (x) for the forward and backward propagation is defined as follows:

Shared Gated Adversarial Networks
Now we present our extension to adversarial training described above, which we call the shared gated adversarial networks (SharedGateAdvNet). Figure 3 shows the overall architecture. The largest difference from Figure 2 is the existence of the domain-specific Bi-LSTMs (Not-shared Bi-LSTMs) that we expect to capture the representations not fitted into the shared LSTMs and specialized to a particular domain. The model comprises of the following three components. Bi-LSTMs. The domain-specific LSTMs exist on each domain. Figure 3 shows the case when two treebanks (domains) are trained at the same time.
When training with three treebanks, there exist three Bi-LSTMs, one for each domain.

Gated connection
The gated connection selects which information to use between the domaininvariant and domain-specific feature representations (the shared Bi-LSTMs and the domainspecific Bi-LSTMs). We get the combined representation r gate t from these two vectors as follows: where σ is the sigmoid function. r share is the output of the shared Bi-LSTMs while r dom is the output of the domain-specific Bi-LSTMs.

Settings
For the settings of the biaffine models, we follow the same network settings as Dozat and Manning (2017): 3-layer, 400-dimentional LSTMs for each direction, 500-dimentional MLP for arc prediction, and 100-dimentional MLP for label prediction. We use the 100-dimensional pre-trained word embeddings trained by word2vec (Mikolov et al., 2013) 2 and the 100-dimensional randomly initialized POS tag embeddings. For the model with adversarial training, we fix λ to 0.5. 3 We apply dropout (Srivastava et al., 2014) with a 0.33 rate at the input and output layers. For optimization, we use Adam (Kingma and Ba, 2014) with the batch size of 128 and gradient clipping of 5. We use early stopping (Caruana et al., 2001) based on the performance on the development set.

Preliminary Experiment
Before selecting the final submitted model for each treebank (Section 5.3) here we perform a small experiment on selected languages (English and French) to see the effectiveness of our domain adaptation techniques.
English experiment First, for English, we compare the performances of several domain adaptation techniques as well as the baselines without adaptation, to see which technique performs better. We compare the following six systems: • UDPipe: The official baseline parser (Straka et al., 2016). This is a transition-based parser selecting each action using neural networks.
• Biaffine: Our reimplementation of the graphbased parser of Dozat and Manning (2017). We use Chainer (Tokui et al., 2015) for our implementation. We train this model independently on each treebank.
•  this by removing the domain classification component in Figure 2.
• Biaffine-MIX + Adv: The model in Figure  2, which shares the same parameters across multiple domains but adversarial training facilitates learning domain-invariant representations.
• SharedGateNet: A simpler version of our proposed architecture (Figure 3), which does not have the adversarial component but has the gated unit controlling the strength of the two, domain-invariant and domain-specific Bi-LSTMs.
• SharedGateAdvNet: Our full architecture ( Figure 3) with both adversarial training and the gated unit.
The result is shown in Table 1. First, we find that our baseline biaffine parser already outperforms the official baseline parser (UDPipe) by a large margin (e.g., for English, 82.45 vs. 80.13 LAS), which suggests the strength of graphbased parsing with Bi-LSTMs that enable the model to capture the entire sentence as a context. By just mixing the training treebanks (Biaffine-MIX), we observe a score improvement for the domains with less data, en lines and en parunt, which only contain 2,738 and 1,090 sentences, respectively. We also observe an additional small gain with adversarial training (Biaffine-MIX Adv). Comparing with this, our proposed architectures (SharedGateNet and SharedGateAdvNet) perform better. This shows the importance of having the domain-specific network layers. Our final architecture SharedGateAdvNet slightly outperforms SharedGateNet, indicating that the adversarial technique also has its own advantage.
Since SharedGateAdvNet consitently outperforms the others in English, we only try this method in the following experiments.
English and French experiment To see the effects of our approach when combining completely different data, i.e., different language treebanks, we perform a small experiment using two languages: English and French. French treebanks are also divided into three domains and also are imbalanced: fr (14,553 sentences), fr partut (620 sentences), and fr sequoria (2,231 sentences). We compare the models trained within each language (3 domains for each), and the model trained with all six treebanks of English and French. The result is shown in Figure 2. Interestingly, especially for fr partut and fr sequoria, we observe a small score improvement by jointly learning two languages. The best model for fr is the biaffine model without joint training. Note also that for en, the effect of the adaptation technique is very small, or  negative, and these suggest our approach may be ineffective for a treebank that already contains sufficient amount of data. Due to time constraints, we were unable to try many language pairs for joint training, but this result suggests the parser may benefit from training across different languages. For the final experiment for model selection below, we try some other pairs for some languages, and select those models when they perform better.

Model Selection
As we summarize in Section 2 we perform a simple model selection for each language with the development data in order to select the final submitted models. Besides a biaffine model with a single treebank, for some treebanks we additionally train other models with our SharedGateAdvNet. Our approach is divided into the following three strategies according to the languages: (i) Training multiple domains within a single language. We try this for many languages, such as English (en), Czech (cs), Spanish (es), etc.
(ii) Training multiple domains across different languages. Based on the positive result of our preliminary experiment, we try this only for obtaining French models (training jointly with English treebanks).
(iii) Training two treebanks in different languages. We only try this for two very small treebanks: Ukrainian (uk), which we train with en, and Uyghur (ug), which we train with Russian (ru) that we find performs better than training with en.
See Table 3 for the results. Again, our baseline biaffine parser outperforms UDPipe in most treebanks. Training multiple domains in one language often brings performance gains, in particular for smaller treebanks, as in the case for English. For example, for Galician, LAS in gl treegal largely improves from 73.63 to 78.08 with our joint training. The largest gain is obtained in Italian (it partut), from 78.72 to 87.34, about 10 points improvement in LAS.
From these results, for example, we select our SharedGateAdvNet model for cs cac while select the biaffine model for cs, which does not benefit from joint training.

Evaluation on Test data
The main result of CoNLL 2017 shared task on the test data is shown in Table 4. In addition to the official baseline (UDPipe) and our system, we also report the scores of the winning system by the Stanford team. See Zeman et al. (2017) for the overview of the other participating systems.
Our system outperforms UDPipe in many test treebanks, 69 out of 81 treebanks. We find many cases that UDPipe performs better are when the training teebank is very small, e.g., Kazakh (kk), Ukrainian (uk), and Uyghur (ug), or not available at all, i.e., surprise languages: Buryat (bxr), Kurmanji (kmr), Upper Sorbian (hsb), and North Sàmi (sme), for which our approach is somewhat naive (Section 2) and UDPipe performs always better. We can also see that for some treebanks (e.g., et, fi pub and hu), our system performs better in UAS while worse in LAS. This may be due to the design of the baseline biaffine model, which determines the best unlabeled tree before assigning the labels (Section 3), i.e., does not perform labeled parsing as a single task.
Our system (NAIST-SATO) achieves the overall average LAS of 70.13, which is the 6th rank among 33 participants in the shared task. UDPipe (68.35) is the 13th rank.

Related Work
A related approach to us in parsing is Ammar et al. (2016), where a single multilingual dependency parser parses sentences in several languages. Differently from our final architecture their model shares all parameters across different languages. In this study, we found the importance of modeling language-specific syntactic features explicitly with the separate Bi-LSTMs.
Our network architecture for domain adaptation is an extension of Ganin and Lempitsky (2015), which applies adversarial training (Goodfellow et al., 2014) for the domain adaptation purpose. There is little prior work applying this adversarial domain adaptation technique to NLP tasks; Chen et al. (2016) use it for cross-lingual sentiment classification, in which the adversarial component has a classifier that tries to classify the language of an input sentence. To our knowledge, this is the first study applying adversarial training for parsing. In addition to the simple application, we also proposed an extended architecture with the domain specific LSTMs and demonstrated the importance of them.

Conclusion
We have proposed a domain adaptation technique with adversarial training for parsing. By applying it on the recent state-of-the-art graph-based dependency parsing model with Bi-LSTMs, we obtained a consistent score improvement, especially for the treebanks having less training data. For the architecture design, we found the importance of preparing the network layer capturing the domain specific representation. We also performed a small experiment for training across multiple languages and had an encouraging result. In this work, we have not investigated incorporating information on language families, so a natural future direction would be to investigate whether typological knowledge helps to select good combinations of languages for training multilingual models.