Transductive Learning of Neural Language Models for Syntactic and Semantic Analysis

In transductive learning, an unlabeled test set is used for model training. Although this setting deviates from the common assumption of a completely unseen test set, it is applicable in many real-world scenarios, wherein the texts to be processed are known in advance. However, despite its practical advantages, transductive learning is underexplored in natural language processing. Here we conduct an empirical study of transductive learning for neural models and demonstrate its utility in syntactic and semantic tasks. Specifically, we fine-tune language models (LMs) on an unlabeled test set to obtain test-set-specific word representations. Through extensive experiments, we demonstrate that despite its simplicity, transductive LM fine-tuning consistently improves state-of-the-art neural models in in-domain and out-of-domain settings.


Introduction
In supervised learning, a model is trained on a training set and its generalization performance is evaluated on an unseen test set. In this setting, the model has no access to the test set during training. However, the assumption of a completely unseen test set is not always necessary. In many cases, certain aspects of the test set are already known at training time. For example, a company may want to annotate a large number of existing documents automatically (Section 3). In such a scenario, the texts to be processed are known in advance, and using the model trained on the texts themselves to process them can be more efficient. Using an unlabeled test set in this way is the key idea behind transductive learning.
In transductive learning (Vapnik, 1998), an unlabeled test set is given in the training phase. That is, the inputs of the test set, i.e., the raw texts, can be used during training, but the labels are never used. In the test phase, the trained model is evaluated on the same test set. Despite its practical advantages, transductive learning has received little attention in natural language processing (NLP). After the pioneering work of Joachims (1999), who proposed a transductive support vector machine for text classification, transductive methods for linear models have been investigated in only a few tasks, such as lexical acquisition (Duh and Kirchhoff, 2006) and machine translation (Ueffing et al., 2007). In particular, transductive learning with neural networks is underexplored.
Here, we investigate the impact of transductive learning on state-of-the-art neural models in syntactic and semantic tasks, namely syntactic chunking and semantic role labeling (SRL). Specifically, inspired by recent findings that language model (LM)-based word representations yield large performance improvement (Devlin et al., 2019), we fine-tune Embeddings from Language Models (ELMo)  on an unlabeled test set and use them in each task-specific model. Typically, LMs are trained on a large-scale corpus whose word distributions are different from the test set. By contrast, transductive learning allows us to fit LMs directly to the distributions of the test set. Our experiments show the effectiveness of transductive LM fine-tuning.
In summary, our main contributions are: • This work is the first to introduce an LM finetuning method to transductive learning 1 . • Through extensive experiments in both indomain and out-of-domain settings, we demonstrate that transductive LM fine-tuning consistently improves state-of-the-art neural models in syntactic and semantic tasks.
Unsupervised domain adaptation. Transductive learning is related to unsupervised domain adaptation, in which models are adapted to a target domain by using unlabeled target domain texts (Ben-David et al., 2010;Shi and Sha, 2012). This setting does not allow models to access the test set, which is the main difference between unsupervised domain adaptation and transductive learning. Various unsupervised adaptation methods have been proposed for linear models (Blitzer et al., 2006;Jiang and Zhai, 2007;Tsuboi et al., 2009;Søgaard, 2013). In the context of neural models, adversarial domain adaptation (Ganin and Lempitsky, 2015;Ganin et al., 2016;Guo et al., 2018), importance weighting (Wang et al., 2017), structural correspondence learning (Ziser and Reichart, 2017), self/tri/co-training (Saito et al., 2017;Ruder and Plank, 2018), and other techniques orthogonal to transductive LM fine-tuning have been applied successfully in unsupervised domain adaptation 2 . Integrating these methods with transductive LM fine-tuning is an interesting direction for future research.
LM-based word representations. Recently, LM-based word representations pre-trained on unlabeled data have gained considerable attention Radford et al., 2018;Devlin et al., 2019). The most related method to ours is Universal Language Model Fine-tuning (ULMFiT), which pre-trains an LM on a large general-domain corpus and fine-tunes it on the target task (Howard and Ruder, 2018). Inspired by these studies, we introduce LM-based word representation in transductive learning.
(3) Figure 1: Training procedure. (1) LM pre-training: the LM is firstly pre-trained on the large-scale unlabeled corpus (2) Transductive LM fine-tuning: the LM is then fine-tuned on the unlabeled test set Note that the test set used for training is the identical one used in evaluation.
(3) Task-specific model training: the taskspecific model is trained on the training set . L denotes the loss function.

Neural Transductive Learning
Motivation. Suppose that a company has received a vast amount of customer reviews and wants to automatically process these reviews more accurately, even if it takes some time. For this purpose, they do not have to build a model that works well on new unseen reviews. Instead, they want a model that works well on only the reviews in hand. In this situation, using these reviews themselves to train a model can be more efficient. This is the key motivation for developing effective and practical transductive learning methods. Toward this goal, we develop transductive methods for stateof-the-art neural models.
Problem formulation. In the training phase, a training set are used for model training, where X i is an input, e.g., a sentence, and Y i represents target labels, e.g., labels from a set of syntactic or semantic annotations. In the test phase, the trained model is used for predicting labels and is evaluated on the same test set D test .
Method. We present a simple transductive method for neural models. Specifically, we finetune an LM on an unlabeled test set. Figure 1 illustrates the training procedure that consists of the following steps: (1) LM pre-training, (2) Transductive LM fine-tuning and (3) task-specific model training. We first train an LM on a largescale unlabeled corpus D large and then fine-tune the LM on an unlabeled test set D test . Finally, we use the fine-tuned LM as the embedding layer of each task-specific model and train the model on a training set D train .
Here, L lm and L task are the loss functions for an LM and task-specific model, respectively. 3 In the LM pre-training and fine-tuning phases (Eqs. 1 and 2), we first train the initial LM parameters Θ and then fine-tune the pre-trained parameters Θ . In the task-specific training phase (Eq. 3), we fix the fine-tuned LM parameters Θ used for the embedding layer of a task-specific model, and train only the task-specific model parameters Φ.

Experiments
Tasks. To investigate the effectiveness of transductive LM fine-tuning for syntactic and semantic analysis, we conduct experiments in syntactic chunking (Ramshaw and Marcus, 1999;Sang and Buchholz, 2000;Ponvert et al., 2011) and SRL (Gildea and Jurafsky, 2002;Palmer et al., 2005;Carreras and Màrquez, 2005) 4 . The goal of syntactic chunking is to divide a sentence into non-overlapping phrases that consist of syntactically related words. The goal of SRL is to identify semantic arguments for each predicate. For example, consider the following sentence: The man kept a cat SYNCHUNK [ NP ] [ NP ] SEMROLE [ A0 ] [ A1 ] In syntactic chunking, given the input sentence, systems have to recognize "The man" and "a cat" as noun phrases (NP). In SRL, given the input sentence and the target predicate "kept", systems have to recognize "The man" as the A0 argument and "a cat" as the A1 argument. For syntactic chunking, we adopted the experimental protocol by Ponvert et al. (2011) and for SRL, we followed Ouchi et al.  Results. Table 2 shows the F1 scores on each test set. All reported F1 scores are the average of five distinct trials using different random seeds. In each cell, the left-hand side denotes the F1 score of the baseline (using a base LM without fine-tuning) and the right-hand side denotes F1 of the transductive models (using a fine-tuned LM on each test set). In in-domain (same source/target domains, e.g., BC→BC) and out-of-domain (different source/target domains, e.g., BC→NW) settings, all transductive models consistently outperformed the baselines, which suggests that transductive LM fine-tuning improves performance of neural models. Although the improvements were undramatic (around 1.0 F1 gain), these consistent improvements can be regarded as valuable empirical results because of the difficulty of unsupervised and low-resource adaptation settings.

Analysis
Comparison between unsupervised domain adaptation and transduction. In unsupervised domain adaptation, target domain unlabeled data (the texts whose domain is the same as that of a test set) is used for adaptation. Although the domain is identical between target domain data and a test set, their word distributions are somewhat different. In transductive learning, because an unlabeled test set can be used for training, it is possible to adapt LMs directly to the word distributions of the test set. Here, we investigate whether adapting LMs directly to each test set is more effective  than adapting LMs to each target domain unlabeled data. Similarly to our transductive method shown in Figure 1, we first train LMs on the largescale unlabeled corpus (the 1B word benchmark corpus) and then fine-tune them on the unlabeled target domain data 8 . In addition, we control the sizes of the target domain unlabeled data and test sets. That is, we use the same number of sentences in the unlabeled data of each target domain as in each test set. Table 3 shows the F1 scores averaged across all the target domains. The transductive models (T) consistently outperformed the domain-adapted models (CU). This demonstrates that adapting LMs directly to test sets is more effective than adapting them to target domain unlabeled data.   Combination of unsupervised domain adaptation and transduction. In real-world situations, large-scale unlabeled data of target domains is sometimes available. In such cases, LMs can be trained on both the target domain unlabeled data and the test sets. Here, we investigate the effectiveness of using both datasets. Table 4 shows the F1 scores averaged across all the target domains. Fine-tuning the LMs on the target domain unlabeled data as well as each test set (U + T) showed better performance than fine-tuning them only on the target domain unlabeled data (U). This combination of tranduction with unsupervised domain adaptation further improves performance.
Effects in standard benchmarks. Some studies indicated that when promising new techniques are only evaluated on very basic models, determining how much (if any) improvement will carry over to stronger models can be difficult (Denkowski and Neubig, 2017;Suzuki et al., 2018). Motivated by such studies, we provide the results in standard benchmark settings. For syntactic chunking, we use the CoNLL-2000 dataset (Sang andBuchholz, 2000) and follow the standard experimental protocol (Hashimoto et al., 2017). For SRL, we use the CoNLL-2005 (Carreras andMàrquez, 2005) and CoNLL-2012datasets (Pradhan et al., 2012 and follow the standard experimental protocol (Ouchi et al., 2018). Table 5 shows the F1 scores of our models and those of existing models. The results of the baseline model were comparable with those of the state-of-the-art models, and the transductive model consistently outperformed the baseline model 9 . Note that we cannot fairly compare the transductive and existing models due to the difference in settings. These results, however, demonstrate that transductive LM fine-tuning improves state-of-the-art chunking and SRL models.

Conclusion
In this study, we investigated the impact of transductive learning on state-of-the-art neural models in syntactic and semantic tasks. Specifically, we fine-tuned an LM on an unlabeled test set.
Through extensive experiments, we demonstrated that, despite its simplicity, transductive LM finetuning contributes to consistent performance improvement of state-of-the-art syntactic and semantic models in cross-domain settings. One interesting line of future work is to explore effective transductive methods for task-dependent (neural) layers. For instance, as some unsupervised domain adaptation methods can be applied to transductive learning, integrating them with transductive LM fine-tuning may further improve their performance. Another line of our future work is to apply these transductive methods to various NLP tasks and investigate their performance.