What can we learn from Semantic Tagging?

We investigate the effects of multi-task learning using the recently introduced task of semantic tagging. We employ semantic tagging as an auxiliary task for three different NLP tasks: part-of-speech tagging, Universal Dependency parsing, and Natural Language Inference. We compare full neural network sharing, partial neural network sharing, and what we term the learning what to share setting where negative transfer between tasks is less likely. Our findings show considerable improvements for all tasks, particularly in the learning what to share setting which shows consistent gains across all tasks.


Introduction
Multi-task learning (MTL) is a recently resurgent approach to machine learning in which multiple tasks are simultaneously learned. By optimising the multiple loss functions of related tasks at once, multi-task learning models can achieve superior results compared to models trained on a single task. The key principle is summarized by Caruana (1998) as "MTL improves generalization by leveraging the domain-specific information contained in the training signals of related tasks". Neural MTL has become an increasingly successful approach by exploiting similarities between Natural Language Processing (NLP) tasks (Collobert and Weston, 2008;Søgaard and Goldberg, 2016;Plank et al., 2016). Our work builds upon Bjerva et al. (2016), who demonstrate that employing semantic tagging as an auxiliary task for Universal Dependency (McDonald et al., 2013) part-of-speech tagging can lead to improved performance.
The objective of this paper is to investigate whether learning to predict lexical semantic categories can be beneficial to other NLP tasks. To achieve this we augment single-task models (ST) 1 1 We replicate models which perform at or close to the with an additional classifier to predict semantic tags and jointly optimize for both the original task and the auxiliary semantic tagging task. Our hypothesis is that learning to predict semantic tags as an auxiliary task can improve performance of single-task systems. We believe that this is, among other factors, due to the following: • Providing the main task's model with a useful inductive bias, encouraging it to prefer representations that lead to semantically plausible hypotheses over those that are not. • Putting the focus of the main task model's attention on features that actually matter by providing additional evidence for the relevance or irrelevance of those features. • Reducing the risk of overfitting by minimizing the model's Rademacher Complexity 2 Representations which are learned for multiple tasks have been shown to generalize better (Baxter et al., 2000).
We test our hypothesis on three disparate NLP tasks: (i) Universal Dependency part-of-speech tagging (UPOS), (ii) Universal Dependency parsing (UD DEP), a complex syntactic task; and (iii) Natural Language Inference (NLI), a complex task requiring deep natural language understanding.
2 Background and Related work 2.1 Semantic Tagging Semantic tagging (Bjerva et al., 2016; is the task of assigning languageneutral semantic categories to words. It is designed to overcome a lack of semantic information syntax-oriented part-of-speech tagsets, such as the state-of-the-art. Our choice of models is based on replicability. 2 The ability to fit random noise. Figure 1: Our three multi-task learning settings: (A) fully shared networks, (B) partially shared networks, and (C) Learning What to Share. Layers are mathematically denoted by vectors and the connections between them, represented by arrows, are mathematically denoted by matrices of weights. S indicates a shared layer, P a private layer, and X a layer with shared and private subspaces.
Penn Treebank tagset (Marcus et al., 1993), usually have. Such tagsets exclude important semantic distinctions, such as negation and modals, types of quantification, named entity types, and the contribution of verbs to tense, aspect, or event.
The semantic tagset is language-neutral, abstracts over part-of-speech and named-entity classes, and includes fine-grained semantic information. The tagset consists of 80 semantic tags grouped in 13 coarse-grained classes. The tagset originated in the Parallel Meaning Bank (PMB) project , where it contributes to compositional semantics and cross-lingual projection of semantic representations. Recent work has highlighted the utility of the tagset as a conduit for evaluating the semantics captured by vector representations (Belinkov et al., 2018), or employed it in an auxiliary tagging task (Bjerva et al., 2016), as we do in this work.

Learning What to Share
Recently, there has been an increasing interest in the development of models which are trained to learn what to (and what not to) share between a set of tasks, with the general aim of preventing negative transfer when the tasks are not closely related (Meyerson and Miikkulainen, 2017;Ruder et al., 2017;Lu et al., 2017;Misra et al., 2016). Our Learning What to Share setting is based on this idea and closely related to Liu et al. (2016)'s shared layer architecture.
Specifically, a layer h X which is shared between the main task and the auxiliary task is split into two subspaces: a shared subspace h X S and a private subspace h X P . The interaction between the shared subspaces is modulated via a sigmoidal gating unit applied to a set of learned weights, as seen in Equations (1) and (2) where h X S(main) and h X S(aux) are the main and auxiliary tasks' shared layers, W a→m and W m→a are learned weights, and σ is a sigmoidal function. Liu et al. (2016)'s Shared-Layer Architecture, in our setup each task has its own shared subspace rather than one common shared layer. This enables the sharing of different parameters in each direction (i.e., from main to auxiliary task and from auxiliary to main task), allowing each task to choose what to learn from the other, rather than having "one shared layer to capture the shared information for all the tasks" as in Liu et al. (2016).

Multi-Task Learning Settings
We implement three neural MTL settings, shown in Figure 1. They differ in the way the network's parameters are shared between the tasks: • Fully shared network (FSN): All hidden layers are entirely shared among the tasks, each task has a separate output layer. The transformation of our input vector x into a shared hidden layer h S is described by Equation (3): • Partially shared network (PSN): A subset of hidden layers is shared among the tasks; each task has at least one private hidden layer and a separate output layer. The transformation of a shared hidden layer h S into private hidden layers, denoted by h P (main) and h P (aux) is described in Equations (4) and (5).
• Learning What to Share (LWS): Each task has a dedicated set of hidden layers. For sharing, a hidden layer is split into a shared subspace and a private subspace. A gating unit modulates the transfer of information between the shared subspaces as shown in Equations (1) and (2).

Data
In the UPOS tagging experiments, we utilize the UD 2.0 English corpus (Nivre et al., 2017) for the POS tagging and the semantically tagged PMB release 0.1.0 (sem-PMB) 3 for the MTL settings. Note that there is no overlap between the two datasets. Conversely, for the UD DEP and NLI experiments there is a complete overlap between the datasets of main and auxiliary tasks, i.e., each instance is labeled with both the main task's labels and semantic tags. We use the Stanford POS Tagger (Toutanova et al., 2003) trained on sem-PMB to tag the UD corpus and NLI datasets with semantic tags, and then use those assigned tags for the MTL settings of our dependency parsing and NLI models. We find that this approach leads to better results when the main task is only loosely related to the auxiliary task. The UD DEP experiments use the English UD 2.0 corpus, and the NLI experiments use the SNLI (Bowman et al., 2015) and SICK-E 4 datasets (Marelli et al., 2014). The provided train, development, and test splits are used for all datasets. For sem-PMB, the silver and gold parts are used for training and testing respectively.

Experiments
We run four experiments for each of the four tasks (UPOS, UD DEP, SNLI, SICK-E), one using the ST model and one for each of the three MTL settings. Each experiment is run five times, and the average of the five runs is reported. We briefly describe the ST models and refer the reader to the 3 http://pmb.let.rug.nl/data.php 4 SICK-E refers to the entailment part of the SICK dataset.
original work for further details due to a lack of space. 5 For reproducibility, detailed diagrams of the MTL models for each task and their hyperparameters can be found in Appendix A.

Universal Dependency POS Tagging
Our tagging model uses a basic contextual onelayer bi-LSTM (Hochreiter and Schmidhuber, 1997) that takes in word embeddings and produces a sequence of recurrent states which can be viewed as contextualized representations. The recurrent r n state from the bi-LSTM corresponding to each time-step t n is passed through a dense layer with a softmax activation to predict the token's tag.
In each of the MTL settings a softmax classifier is added to predict a token's semantic tag and the model is then jointly trained on the concatenation of the sem-PMB and UPOS tagging data to minimize the sum of softmax cross-entropy losses of both the main (UPOS tagging) and auxiliary (semantic tagging) tasks.

Universal Dependency Parsing
We employ a parsing model that is based on Dozat and Manning (2016). The model's embeddings layer is a concatenation of randomly initialized word embeddings 6 and character-based word representations added to pre-trained word embeddings, which are passed through a 4-layer stacked bi-LSTM. Unlike Dozat and Manning (2016), our model jointly learns to perform UPOS tagging and parsing, instead of treating them as separate tasks. Therefore, instead of tag embeddings, we add a softmax classifier to predict UPOS tags after the first bi-LSTM layer. The outputs from that layer and the UPOS softmax prediction vectors are both concatenated to the original embedding layer and passed to the second bi-LSTM layer. The output of the last bi-LSTM is then used as input for four dense layers with a ReLU activation, producing four vector representations: a word as a dependent seeking its head; a word as a head seeking all its dependents; a word as a dependent deciding on its label; a word as head deciding on the labels of its dependents. These representations are then passed to biaffine and affine softmax classifiers to produce a fully-connected labeled probabilistic dependency graph (Dozat and Manning, 2016). Finally, a non-projective maximum spanning tree parsing algorithm (Chu, 1965;Edmonds, 1967) is used to obtain a well-formed dependency tree. 7 Similarly to UPOS tagging, an additional softmax classifier is used to predict a token's semantic tag in each of the MTL settings, as both tasks are jointly learned. In the FSN setting, the 4-layer stacked bi-LSTM is entirely shared. In the PSN setting the semantic tags are predicted from the second layer's hidden states, and the final two layers are devoted to the parsing task. In the LWS setting, the first two layers of the bi-LSTM are split into a private bi-LSTM private and a shared bi-LSTM shared for each of the tasks with the interaction between the shared subspaces being modulated via a gating unit. Then, two bi-LSTM layers that are devoted to parsing only are stacked on top.

Natural Language Inference
We base our NLI model on Chen et al. (2017)'s Enhanced Sequential Inference Model which uses a bi-LSTM to encode the premise and hypothesis, computes a soft-alignment between premise and hypothesis' representations using an attention mechanism, and employs an inference composition bi-LSTM to compose local inference information sequentially. 8 The MTL settings are implemented by adding a softmax classifier to predict semantic tags at the level of the encoding bi-LSTM, with rest of the model unaltered. In the FSN setting, the hidden states of the encoding bi-LSTM are directly passed as input to the softmax classifier. In the PSN setting an earlier bi-LSTM layer is used to predict the semantic tags and the output from that is passed on to the encoding bi-LSTM which is stacked on top. This follows Hashimoto et al. (2016)'s hierarchical approach. In the LWS setting, a bi-LSTM layer with private and shared subspaces is used for semantic tagging and for the ESIM model's encoding layer. In all MTL settings, the bi-LSTM used for semantic tagging is pre-trained on the sem-PMB data.

Results and Discussion
Results for all tasks are shown in Table 1. In line with Bjerva et al. (2016)'s findings, the FSN set-ting leads to an improvement for UPOS tagging. POS tagging, a sequence labeling task, can be seen as the most closely related to semantic tagging, therefore negative transfer is minimal and the full sharing of parameters is beneficial. Surprisingly, the FSN setting also leads improvements for UD DEP. Indeed, for UD DEP, all of the MTL models outperform the ST model by increasing margins. For the NLI tasks, however, there is a clear degradation in performance.
The PSN setting shows mixed results and does not show a clear advantage over FSN for UPOS and UD DEP. This suggests that adding taskspecific layers after fully-shared ones does not always enable sufficient task specialization. For the NLI tasks however, PSN is clearly preferable to FSN, especially for the small-sized SICK-E dataset where the FSN model fails to adequately learn. , and learning what to share (LWS). All scores are reported as accuracy, except UD DEP for which we report LAS/UAS F 1 score.
As a sentence-level task, NLI is functionally dissimilar to semantic tagging. However, it is a task which requires deep understanding of natural language semantics and can therefore conceivably benefit from the signal provided by semantic tagging. Our results demonstrate that it is possible to leverage this signal given a selective sharing setup where negative transfer can be minimized. Indeed, for the NLI tasks, only the LWS setting leads to improvements over the ST models. 9 The improvement is larger for the SICK-E task which has a much smaller training set and therefore stands to learn more from the semantic tagging signal. For all tasks, it can be observed that the LWS models outperform the rest of the models. This is in line with our expectations with the findings from previous work (Ruder et al., 2017;Liu et al.,

Analysis
In addition to evaluating performance directly, we attempt to qualify how semtags affect performance with respect to each of the SNLI MTL settings 10 .

Qualitative analyses
The fact that NLI is a sentence-level task, while semantic tags are word-level annotations presents a difficulty in measuring the effect of semantic tags on the systems' performance, as there is no oneto-one correspondence between a correct label and a particular semantic tag. We therefore employ the following method in order to assess the contribution of semantic tags. Given the performance ranking of all our systems -F SN < ST < P SN < LW S -we make a pairwise comparison between the output of a superior system S sup and an inferior system S inf . This involves taking the pairs of sentences that every S sup classifies correctly, but some S inf does not. Given that FSN is the worst performing system and, as such, has no 'worse' system for comparison To gain insight as to where a given system S sup performs better than a given S inf , we then sort each comparison sentence set by the frequency of semtags predicted therein, which are normalized by dividing by their frequency in the full SNLI test set. We notice interesting patterns, visible in Figure 2. Specifically, PSN appears markedly better at sentences with named entities (ART, PER, GEO, ORG) and temporal entities (DOM) than both ST and the FSN. Marginal improvements are also observed for sentences with negation and reflexive pronouns. The LWS setting continues this pattern, with additional improvements observable for sentences with the HAP tag for names of events, SST for subsective attributes, and the ROL tag for role nouns.

Contribution of semantic tagging
To assess the contribution of the semantic tagging auxiliary task independent of model architecture and complexity we run three additional SNLI experiments -one for each MTL setting -where the model architectures are unchanged but the auxiliary tasks are assigned no weight (i.e. do not affect the learning). The results confirm our previous findings that, for NLI, the semantic tagging auxiliary task only improves performance in a selective sharing setting, and hurts it otherwise: i) the FSN system which had performed below ST improves to equal it and ii) the PSN and LWS settings both see a drop to ST-level performance.

Conclusions
We present a comprehensive evaluation of MTL using a recently proposed task of semantic tagging as an auxiliary task. Our experiments span three types of NLP tasks and three MTL settings. The results of the experiments show that employing semantic tagging as an auxiliary task leads to improvements in performance for UPOS tagging and UD DEP in all MTL settings. For the SNLI tasks, requiring understanding of phrasal semantics, the selective sharing setup we term Learning What to Share holds a clear advantage. Our work offers a generalizable framework for the evaluation of the utility of an auxiliary task. A MTL setting Diagrams, Preprocessing, and Hyperparameters UPOS Tagging Figure 3a shows the three MTL models used for UPOS. All hyperparameters were tuned with respect to loss on the English UD 2.0 UPOS validation set. We trained for 20 epochs with a batch size of 128 and optimized using Adam (Kingma and Ba, 2014) with a learning rate of 0.0001. We weight the auxiliary semantic tagging loss with λ = 0.1. The pre-trained word embeddings we used are GloVe embeddings (Pennington et al., 2014) of dimension 100 trained on 6 billion tokens of Wikipedia 2014 and Gigaword 5. We applied dropout and recurrent dropout with a probability of 0.3 to all bi-LSTMs. Figure 3b shows the three MTL models for UD DEP. We use the gold tokenization. All hyperparameters were tuned with respect to loss on the English UD 2.0 UD validation set. We trained for 15 epochs with a batch size of 50 and optimized using Adam with a learning rate of 2e − 3. We weight the auxiliary semantic tagging loss with λ = 0.5. The pre-trained word embeddings we use are GloVe embeddings of dimension 100 trained on 6 billion tokens of Wikipedia 2014 and Gigaword 5. We applied dropout with a probability of 0.33 to all bi-LSTM, embedding layers, and nonoutput dense layers.  Figure 3c shows the three MTL models for NLI. All hyperparameters were tuned with respect to loss on the SNLI and SICK-E validation datasets (separately). For the SNLI experiments, we trained for 37 epochs with a batch size of 128. For the SICK-E experiments, we trained for 20 epochs with a batch size of 8. Note that the ESIM model was designed for the SNLI dataset, therefore performance is non-optimal for SICK-E. For both sets of experiments: we optimized using Adam with a learning rate of 0.00005; we weight the auxiliary semantic tagging loss with λ = 0.1; the pre-trained word embeddings we use are GloVe embeddings of dimension 300 trained on 840 billion tokens of Common Crawl; and we applied dropout and recurrent dropout with a probability of 0.3 to all bi-LSTM, and non-output dense layers.   Table 4: Per-label precision (left) and recall (right) for all models.

B SNLI model output analysis
Premise-hypothesis pairs ST LWS/GOLD P: The DEF gentleman CON is NOW speaking EXS while SUB the DEF others ALT are NOW listening EXS N E H: The DEF man CON is NOW being EXS given EXS respect CON P: Men CON wearing EXG hats CON walk EXS on REL the DEF street CON C E H: The DEF men CON having EXS hats CON on REL their HAS head CON P: Three QUC men CON in REL orange IST suits CON are NOW doing EXG street CON repairs CON at REL night CON N C H: Three QUC men CON in REL orange IST suits CON escaped EPS from REL prison CON P: A DIS toddler CON sits ENS on REL a DIS stone CON wall CON surrounded EXS by REL fallen EXS leaves CON E C H: An DIS child CON is NOW throwing EXG stones CON at REL a DIS leaf CON wall CON P: An DIS old IST shoemaker CON in REL his HAS factory CON C N H: The DEF shoemaker CON is NOW wealthy IST P: A DIS kid CON slides CON down IST a DIS yellow COL slide CON into REL a DIS swimming CON pool CON E N H: The DEF kid CON is NOW playing EXS at REL the DEF waterpark CON Table 2: Examples of the entailment problems from SNLI which are incorrectly classified by the ST model but correctly classified by the LWS model. Automatically assigned semantic tags are in superscript.