Semi-Supervised Semantic Role Labeling with Cross-View Training

The successful application of neural networks to a variety of NLP tasks has provided strong impetus to develop end-to-end models for semantic role labeling which forego the need for extensive feature engineering. Recent approaches rely on high-quality annotations which are costly to obtain, and mostly unavailable in low resource scenarios (e.g., rare languages or domains). Our work aims to reduce the annotation effort involved via semi-supervised learning. We propose an end-to-end SRL model and demonstrate it can effectively leverage unlabeled data under the cross-view training modeling paradigm. Our LSTM-based semantic role labeler is jointly trained with a sentence learner, which performs POS tagging, dependency parsing, and predicate identification which we argue are critical to learning directly from unlabeled data without recourse to external pre-processing tools. Experimental results on the CoNLL-2009 benchmark dataset show that our model outperforms the state of the art in English, and consistently improves performance in other languages, including Chinese, German, and Spanish.


Introduction
Semantic role labeling -the task of automatically identifying and labeling the semantic roles conveyed by sentential constituents -has enjoyed renewed interest in the last few years thanks to the popularity of neural network models and their ability to learn continuous representations which forego the need for extensive feature engineering. Recent modeling developments aside, semantic role labeling (SRL) has been generally recognized as a core task in NLP with relevance for applications ranging from machine translation (Aziz et al., 2011;Marcheggiani et al., 2018), to information extraction (Christensen et al., 2011), and summarization (Khan et al., 2015).
State-of-the art semantic role labelers (He et al., 2018b; rely on high-quality annotations (of semantic predicates and their arguments) for use in training. These annotations are costly to obtain, and mostly unavailable in low resource scenarios (e.g., rare languages or domains) motivating the need for effective semi-supervised methods that leverage unlabeled examples. Cross-View Training (CVT; ) is a recently proposed semi-supervised learning algorithm that improves representation learning using a mix of labeled and unlabeled data. The main idea behind CVT is to train a model to produce consistent predictions across different restricted views of the input with the aid of auxiliary prediction tasks.  demonstrate performance gains when applying CVT to sequence tagging tasks, machine translation, and dependency parsing.
Unfortunately, application of CVT to semantic role labeling is fraught with difficulty. This is partly due to the nature of the task which relies on various syntactic features, even when conceptualized as a sequence labeling task He et al., 2018b;. The reliance on syntactic features is problematic for semi-supervised training, since these would have to be extracted from large amounts of unlabeled data. In addition, any semantic role labeler would need to identify (and disambiguate) predicates prior to labeling their augments which might be given during training (e.g., as in the CoNLL 2009 dataset), but would still have to be detected on unlabeled data. Resorting to various pre-processing tools for semi-supervised training almost defeats the purpose of using unlabeled data which would have to be annotated, albeit automatically, with pre-trained models which require labeled data on their own, and might not be portable across languages and domains. Moreover, the usage of external tools often leads to pipeline-style  architectures where errors propagate to later processing stages, affecting model performance.
In this paper we aim to render semi-supervised learning for semantic role labeling as simple as possible, by eliminating the reliance on multiple external pre-processing tools. We develop a sentence learner which is able to perform all tasks subsidiary to semantic role labeling (i.e., POS tagging, dependency parsing, and predicate detection) on labeled and unlabeled data. The sentence learner is jointly trained with a semantic role labeler, and its outputs are fed to the semantic role labeler during supervised and semisupervised learning. Aside from building a selfsufficient semantic role labeler which can be directly applied on plain text, an added benefit of the proposed approach is that the sentence learner naturally provides multiple hidden layers (for the various subtasks) from which "multi-task hidden features" (Peters et al., 2018) can be extracted.
In addition to overcoming the difficulty of utilizing plain text for semi-supervised SRL, we show that application of CVT to SRL requires special attention over and above the sequence tagging and dependency parsing tasks discussed in . We investigate how to best formulate different views for CVT focusing on semantic predicates which have been proven to be very useful in recent SRL models . Experimental results on the CONLL-2009 benchmark datasets show that our model is able to outperform the state of the art in English, and to improve SRL performance in other languages, including Chinese, German, and Spanish.

Model Description
Figure 1 provides a schematic overview of our model which has two main components, namely a sentence learner and a semantic role labeler. The sentence learner consists of: • look-ups of word embeddings and character embeddings; • a bidirectional LSTM (BiLSTM) encoder which takes as input the representation of each word in a sentence and produces context-dependent representations; • a multi-task prediction module to perform POS tagging, dependency parsing, and predicate identification.
While the semantic role labeler consists of: • an input layer which combines multi-source representations of the input sentence; • a primary bidirectional LSTM (BiLSTM) encoder which takes as input the representation of each word in a sentence and produces context-dependent embeddings; • a high-level BiLSTM encoder which takes as input the hidden states of the primary BiL-STM and multi-task hidden features; • a biaffine classifier for calculating the score of each semantic role for each word.

Sentence Learner
The sentence learner (see Figure 1b) operates over sentences to perform the intermediate tasks of POS tagging, dependency parsing, and predicate identification, which are subsequently used to inform the decisions of the semantic role labeler.
Sentence Encoder As we expect the sentence encoder to be applied directly to plain text, we represent words as the concatenation of characterand word-level features. We learn character-level representations x cr by feeding character embeddings to a convolutional neural network module. We represent words with randomly initialized word embeddings x re ∈ R dw , pre-trained word embeddings x pe ∈ R dw estimated on an external text collection, and pre-trained ELMo embeddings x elmo (Peters et al., 2018). The final word representation is given by Following , sentences are represented using a two-layer bidirectional LSTM (Hochreiter and Schmidhuber, 1997); the BiLSTM receives at time step t a representation x for each word and recursively computes two hidden states, one for the forward pass ( − → h t ), and another one for the backward pass ( ← − h t ). Each word is the concatenation of its forward and backward LSTM state vectors Multi-task Learning After obtaining word representations with the sentence encoder, the input is POS tagged, dependency parsed, and predicates are identified. Given sentence s, the probability distribution of POS tags for word w is obtained using a one hidden-layer neural network applied to the corresponding encoder output h w 2 : For dependency parsing, words in a sentence are treated as nodes in a graph. In particular, each word w in sentence s receives exactly one in-going edge (u, w, r) from head word u to its dependent w with relation r. We use a graph-based dependency parser similar to the one presented in , which treats dependency parsing as a classification task and its goal is to predict which in-going edge (u, w, r) connects to each word w. Mathematically, the probability of an edge is: where "score" is the scoring function: The bilinear classifier uses a weight matrix W r specific to the candidate relation and a weight matrix W shared across all dependency relations. With regard to predicate identification, we introduce a virtual root following  who model the entire SRL task as word pair classification. Similarly to the dependency parsing module described above, representations produced by the encoder for the virtual root and words are passed through separate hidden layers and a biaffine classifier is applied to produce a score for each word.

Semantic Role Labeler
Word Representation For our SRL model, words are represented by a vector x which is the concatenation of four types of features: predicatespecific, character-level, word-level, and multitask features. Following previous work (Marcheggiani et al., 2017), we leverage a predicate specific indicator embedding x ie rather than directly using a binary flag. Character-and word-level features are shared with the sentence learner.
With regard to multi-task features, for each word w in input sentence s, we employ a probability-weighted POS tag embedding e pos and dependency relation embedding e dep : where e l and e r and the embeddings of POS tags and dependency relations respectively, and p(y dep = r|s, u) is the probability of relation r given its predicted dependency head u.
In order to incorporate more syntactic information, we adopt as an additional feature the probability of linking a word to candidate predicate x pr which we obtain from the sentence learner. x pr is made of two scalar values, p head and p dep , which represent the probability of a word being the syntactic head or dependent of the current predicate, respectively.

Multi-task Hidden Features
Drawing inspiration from ELMo (Peters et al., 2018), a recently proposed model for generating word representations based on bidirectional LSTMs trained with a coupled language modeling objective, we extract various hidden features via multi-task learning. ELMo representations are deep, essentially a linear combination of representations learnt at all layers of an LSTM instead of just the final layer. Compared with unsupervised ELMo representations, our sentence learner takes advantage of labeled data -it attempts to learn representations towards multiple SRL-related tasks rather than generally effective ones.
In order to utilize all hidden layers in the sentence learner, we collapse them into a single vector. Although we could simply concatenate these or select the top layer, we compute multi-task hidden features h M T as a weighting of the BiLSTM layers, followed by a non-linear projection: where L is the depth of sentence learner, β are softmax-normalized weights for h j , and the scalar parameter γ is of practical importance to aid optimization (Peters et al., 2018).
Biaffine Role Scorer After the high-level BiL-STM encoder produces representations h for each word, we perform two distinct non-linear transformations for the currently considered predicate and its candidate arguments, respectively: where h pred and h arg are hidden representations for the predicate and candidate arguments. The score s role of a semantic role between a predicate and its arguments is calculated as: where W role , U role , and b role are parameters updated during training.

Cross-view Training for SRL
CVT works by improving representation learning for a model. Let D ul = {x 1 , x 2 , .., x N } represent an unlabeled dataset. We use p θ (y|x i ) to denote the output distribution over classes produced by the model with parameters θ. CVT adds multiple different auxiliary prediction modules to a model, which are used when learning on unlabeled examples. Each prediction module takes as input an intermediate representation h j (x i ) produced by a primary BiLSTM encoder and outputs a distribution over all possible classes p j θ (y|x i ). Each h j is chosen such that it can only see parts of the input.
Given an unlabeled example, the model first produces soft targets p θ (y|x i ) by performing inference. CVT then trains auxiliary prediction modules to match the teacher prediction module on the unlabeled data by minimizing: where D is a distance function between probability distributions (we use KL divergence). During training, we keep predictions p θ (y|x i ) from the teacher module fixed so that the auxiliary modules learn to imitate the teacher, but not vice versa. As auxiliary modules train, the representations they use as input improve so they are useful for making predictions even when some of the models' inputs are not available. This in turn improves the primary prediction module, which is built on top of the same shared representations. We applied CVT on the primary Bi-LSTM encoder of our semantic role labeler, while utilizing the output of the sentence learner on the unlabeled data. Given unlabeled sentence s = w 1 , ..., w T , the primary Bi-LSTM encoder produces hidden representations h pri for each word, while the semantic role labeler produces the teacher prediction. The sentence learner may recognize more than one words as predicates in sentence s, and we just randomly choose one as the target predicate w p .
The auxiliary prediction modules take − → h t pri and ← − h t pri as input. Specifically, we add the following four auxiliary prediction modules to the model: The "forward" module makes predictions without seeing the right context of the current word. training set. The current predicate is "believe", while "commission" and "in" are two candidate arguments.
The "future" module makes predictions without seeing the right context and the current word itself. The "backward" and "past" modules are defined analogously for left contexts. Figure 2 illustrates the auxiliary modules and the types of context they see. Unlike the biaffine role scorer, the auxiliary modules do not explicitly use the hidden state of the target predicate, so as to encourage the primary Bi-LSTM encoder to capture long-distance relations between the predicate and its context.
We empirically observed (see Section 3) that applying CVT on representations which do not "see" the target predicate is not ideal for SRL. We therefore devised a strategy which applies different auxiliary modules to each word depending on its relative position to the target predicate. We only apply "backward" and "past" modules to words preceding the predicate, while "forward" and "future" modules apply to words following the predicate. This way, we ensure that each word is aware of the current predicate when performing CVT. In the example in Figure 2, "backward" and "past" views would be applied to commission and "forward" and "future" views to in.
We also apply CVT on the first hidden layer of the sentence learner to further improve the performances of auxiliary tasks, utilizing the views introduced in  for sequence tagging and dependency parsing.

Training Objectives
For both supervised learning and cross-view training our model makes predictions on labeled and unlabeled examples across all tasks (SRL and auxiliary tasks). During supervised learning, the model is trained on labeled data and its objective is the sum of cross-entropy losses for all tasks. With respect to multi-task CVT, the model takes unlabeled examples as input and calculates the CVT loss given in Equation (8). The semi-supervised objective is the sum of the CVT loss for all auxiliary modules across all tasks.

Experiments
We implemented our model 1 in PyTorch and evaluated it on the English CoNLL-2009 benchmark following the standard training, testing, and development set splits. To evaluate whether our model generalizes to other languages, we also report experiments on Chinese, German, and Spanish, again using standard CoNLL-2009 splits. This subset of languages has been commonly used in previous work (Björkelund et al., 2010;Roth and Lapata, 2016;Lei et al., 2015) and allows us to compare our model against a wide range of alternative approaches. The benchmarks contain goldstandard dependency annotations, and also gold lemmas, part-of-speech tags, and morphological features. With regard to unlabeled datasets, we used the 1 Billion Word Language Model Benchmark (Chelba et al., 2013)   For experiments on English, we used the embeddings of  which were learned using the structured skip n-gram approach of . We also used a convolutional neural network (Chiu and Nichols, 2016;Ma and Hovy, 2016) to learn character-level representations. For Chinese, Spanish, and German word embeddings were pre-trained on Wikipedia using fastText (Bojanowski et al., 2017).
The Bi-LSTM encoders in our model, used recurrent dropout (Gal and Ghahramani, 2016) with an 80% keep probability between time-steps and layers during supervised training; keep probability was set to 90% when applying the model to unlabeled data. We used the Adam optimizer (Kingma and Ba, 2014) and performed hyperparameter tuning and model selection on the English development set; optimal hyperparameter values (for all languages) are shown in Table 1.

Results
Our results on the English (in-domain) test set are summarized in Table 2. We compared our system against previous models which employ external tools to obtain required features. We also report the results of various ensemble SRL models 4 catalog.ldc.upenn.edu/LDC99T41  Table 3: CoNLL-2009 out-of domain results (English; Brown test set). Differences in F 1 between our models and previous systems are statistically significant (p < 0.05) using stratified shuffling (Noreen, 1989).
(second block). Most comparisons involve neural systems which are based on BiLSTMs He et al., 2018b; or use neural networks for learning SLR-specific embeddings (FitzGerald et al., 2015;Roth and Lapata, 2016). We also report the results of two strong symbolic models based on tensor factorization (Lei et al., 2015) and a pipeline of modules that carry out tokenization, lemmatization, part-of-speech tagging, dependency parsing, and semantic role labeling (Björkelund et al., 2010). As can be seen in Table 2, our supervised model outperforms previously published single and ensemble models. With cross-view training, our model achieves 91.2% F 1 (the difference over the supervised model is statistically significant at p < 0.05), which is an absolute improvement of 1.4% over the state of the art .
Results on the out-of-domain English test set are presented in Table 3. We include comparisons with the same models as in the in-domain case. Again, our end-to-end model significantly outperforms previously published single and ensemble models, even without taking unlabeled data into account. We achieve a relatively higher improvement with CVT on out-of-domain data (F 1 increases from 81.6% to 82.5%, and the difference is significant at p < 0.05). This suggests that semi-  supervised training indeed increases the robustness of our model, leading to more accurate predictions for both SRL and auxiliary tasks. Table 4 presents the results of our experiments (without ELMo) on Chinese, German, and Spanish. Although we have not performed detailed parameter selection in these languages (i.e., we used the same parameters as in English), our model achieves state-of-the-art performance across all three languages.

Ablation Studies
To investigate the contribution of the sentence learner and cross-view training, we conducted a series of ablation studies on the English development set without predicate disambiguation.
Our experiments are summarized in Table 5. The first block shows the performance of the full model. In the second block, we assess the effect of different kinds of representations used in our model. Interestingly, the impact of ELMo (about 0.6 in F 1 ) is slightly less compared to multi-task hidden features (about 0.7 in F 1 ). This suggests that multi-task hidden features provide as useful information for SRL as pre-trained represen-   tations. We next eliminate the sentence learner model and have the semantic role labeler use the predicted POS tags and dependency labels provided in CoNLL-2009 dataset. As can be seen, this leads to a substantial drop in performance over the full model (1.4% in F 1 ).
In the third block, we remove cross-view training from our model, and observe a 0.8% drop in F 1 over the full model. Finally, we apply the auxiliary modules on the full sentence instead of treating the words preceding and following the target predicate differently, and observe a 0.3% drop in F 1 over the supervised model. This is not surprising as the predicate indicator plays an important role in improving the performance of semantic role labeler.

CVT Analysis
In Table 6, we briefly explore which auxiliary prediction modules are more important for CVT when applied to SRL. We apply two types of auxiliary modules both of which take care not to "see" the target predicate directly. The "forward/backward" module does not see the right/left context of the current word, while the "future/past" module does not see the right/left context and the current token itself. Both kinds of modules improve performance (over a supervised model without CVT, see second row in Table 5); future and past modules are slightly better corroborating the results of  on sequence tagging. Overall, the results in Table 6 suggest that more restricted views of the input are beneficial.  We further explore how the strategy of selecting the target predicate (in sentences containing multiple candidates) influences performance. For each unlabeled sentence, we adopt the strategy of randomly selecting a predicate amongst those words identified as predicate candidates by the sentence learner. We could also select the predicate with the highest predicted score or a random word from the sentence. The experiments in Table 7 confirm that the adopted strategy works best delivering a 0.8 improvement in F 1 over a supervised model without CVT (second row in Table 5). Selecting the most confident predicate is the worst possible strategy, decreasing F 1 performance by 0.6 over a supervised model without CVT (see Table 5); the model concentrates on a few predicates with very high scores (these tend to be common verbs such as say, is, and have), while ignoring nominal predicates and less frequent verbs. The strategy of randomly selecting a word from the sentence performs better, precisely because it pays attention to a wider range of predicates.

Related Work
Our model resonates the recent trend of developing neural network models for semantic role labeling using relatively simple architectures based on bidirectional LSTMs Strubell et al., 2018). It also agrees with previous work in adopting multi-task learning as a means to improve a main task by jointly learning one or more related auxiliary tasks (Collobert et al., 2011;Søgaard and Goldberg, 2016;Swayamdipta et al., 2017;Peng et al., 2017;Strubell et al., 2018).
The idea of resorting to semi-supervised learning as a means of reducing the annotation effort involved in creating labeled data for SRL is by no means new. Fürstenau and Lapata (2012) propose to augment a labeled dataset with unlabeled examples whose roles are inferred automatically via annotation projection. Other work uses a language model to learn word similarities from unlabeled texts (Croce et al., 2010;Deschacht and Moens, 2009) or constructs an informed prior from labeled data in order to learn a generative model from unlabeled data (Titov and Klementiev, 2012).
More recently, Mehta et al. (2018) have proposed a semi-supervised method for constituencybased SRL. Their work builds upon a state-of-theart neural model (He et al., 2018b;Peters et al., 2018) whose training objective they augment with a syntactic inconsistency loss component. Their hypothesis is that by leveraging syntactic structure during training, the SRL model may become more robust in low resource scenarios. This method is very much geared towards improving constituentbased SRL, where syntactic constraints are widely used during decoding (He et al., 2017(He et al., , 2018a. And requires a robust syntactic parser to analyze (out-of-domain) unlabeled sentences for consistency. Our model does not rely on external tools, and is generally applicable across semantic role representations based on dependencies or constituents (i.e., phrases or spans). However, we leave experiments on the latter for future work.
Although the focus of this work has been on semi-supervised learning, we have developed a competitive SRL system which could be used on its own, after being trained on labeled data. Following previous work (Strubell et al., 2018; we proposed an end-to-end model, which is able to distinguish predicates and label their arguments while learning a POS-tagger and a dependency parser. Importantly, our model can be simultaneously used for supervised and semisupervised learning without modification.  do not use any syntactic information which is critical when dealing with unlabeled data. Strubell et al. (2018) directly predict POS tags and predicates on top of the lower layers of their model; while this information is fed to the final SRL classifier, it is not propagated through the network, and is not shared with their multi-head self-attention layers.

Conclusions
In this paper we developed an end-to-end SRL model and demonstrated it can effectively leverage unlabeled data under the crossview training modeling paradigm. Experiments on the CONLL-2009 benchmark datasets show that our model delivers state of the art performance in English, Chi-nese, German, and Spanish. Directions for future work are many and varied. We would like to apply the proposed model in low-resource settings, e.g., to transfer roles from English to another language via annotation projection or to learn an SRL model from weak supervision where only annotations for dependency labels are available.