Multi-Label Transfer Learning for Multi-Relational Semantic Similarity

Multi-relational semantic similarity datasets define the semantic relations between two short texts in multiple ways, e.g., similarity, relatedness, and so on. Yet, all the systems to date designed to capture such relations target one relation at a time. We propose a multi-label transfer learning approach based on LSTM to make predictions for several relations simultaneously and aggregate the losses to update the parameters. This multi-label regression approach jointly learns the information provided by the multiple relations, rather than treating them as separate tasks. Not only does this approach outperform the single-task approach and the traditional multi-task learning approach, but it also achieves state-of-the-art performance on all but one relation of the Human Activity Phrase dataset.


Introduction
Semantic similarity, or relating short texts or sentences 1 in a semantic space -be those phrases, sentences or short paragraphs -is a task that requires systems to determine the degree of equivalence between the underlying semantics of the two sentences. Although relatively easy for humans, this task remains one of the most difficult natural language understanding problems. The task has been receiving significant interest from the research community. For instance, from 2012 to 2017, the International Workshop on Semantic Evaluation (SemEval) has been holding the Semantic Textual Similarity (STS) shared tasks (Agirre et al., 2012(Agirre et al., , 2013b(Agirre et al., , 2015(Agirre et al., , 2016Cer et al., 2017), dedicated to tackling this problem, with close to 100 team submissions each year.
In some semantic similarity datasets, an example consists of a sentence pair and a single annotated similarity score, while in others, each pair 1 In this work, we do not consider word level similarity. comes with multiple annotations. We refer to the latter as multi-relational semantic similarity tasks. The inclusion of multiple annotations per example is motivated by the fact that there can be different relations, namely different types of similarity between two sentences. So far, these relations have been treated as separate tasks, where a model trains and tests on one relation at a time while ignoring the rest. However, we hypothesize that each relation may contain useful information about the others, and training on only one relation inevitably neglects some relevant information. Thus, training jointly on multiple relations may improve performance on one or more relations.
We propose a joint multi-label transfer learning setting based on LSTM, and show that it can be an effective solution for the multi-relational semantic similarity tasks. Due to the small size of multirelational semantic similarity datasets and the recent success of LSTM-based sentence representations (Wieting and Gimpel, 2018;Conneau et al., 2017), the model is pre-trained on a large corpus and transfer learning is applied using fine-tuning. In our setting, the network is jointly trained on multiple relations by outputting multiple predictions (one for each relation) and aggregating the losses during back-propagation. This is different from the traditional multi-task learning setting where the model makes one prediction at a time, switching between the tasks. We treat the multi-task setting and the single-task setting (i.e., where a separate model is learned for each relation) as baselines, and show that the multi-label setting outperforms them in many cases, achieving state-of-the-art performance on all but one relation of the Human Activity Phrase dataset (Wilson and Mihalcea, 2017).
In addition to success on multi-relational semantic similarity tasks, the multi-label transfer learning setting that we propose can easily be paired with other neural network architectures and applied to any dataset with multiple annotations available for each training instance.

Multi-Label Transfer Learning
We introduce a multi-label transfer learning setting by modifying the architecture of the LSTMbased sentence encoder, specifically designed for multi-relational semantic similarity tasks.

Architecture
We employ the "hard-parameter sharing" setting (Caruana, 1998), where some hidden layers are shared across multiple tasks while each task has its own specific output layer. As shown in Figure 1, using an example of a semantic similarity dataset with two relations, sentence L and sentence R in a pair are first mapped to word vector sequences and then encoded as sentence embeddings. Up to this step, the choice of the word embedding matrix and sentence encoder is flexible, and we outline our choice in the sections to follow. For each relation that has been annotated with a ground-truth label, a dedicated output dense layer takes the two sentence embeddings as input and outputs a probability distribution across the range of possible scores. The output dense layers follow the methods of Tai et al. (2015).
With two such dense output layers, two losses are calculated, one for each relation. The total loss is calculated as the sum of the two losses for backpropagation which updates all parameters in the end-to-end network.

Model
We use InferSent (Conneau et al., 2017) as the sentence encoder due to its outstanding performances reported on various semantic similarity tasks.
Due to the small sizes of the evaluation datasets, we use the sentence encoder pre-trained on the Stanford Natural Language Inference corpus (Bowman et al., 2015) and Multi-Genre Natural Language Inference corpus (Williams et al., 2018), and transfer to the semantic similarity tasks using fine-tuning. In this process, the output layers for multi-label learning discussed above are stacked on top of the InferSent network, forming an end-to-end model for training and testing on semantic similarity tasks.

Comparison with Multi-Task Learning
Neither multi-task nor multi-label learning have been used for multi-relational semantic similarity datasets. For these datasets, either multi-task or multi-label learning can be achieved by treating each relation as a "task." The key differences between the two are the relations involved in each forward-backward pass and the timing of the parameter updates.
Consider a training step in the two-relation example in Figure 1: A multi-task learning model would pick a batch of sentences pairs, only consider Label L, only calculate Loss L, and all parameters except those of dense layer d R are updated. Then, within the same batch, 2 the model would only consider Label R, only calculate Loss R, and all parameters except those of dense layer d L are updated.
A multi-label learning model (our model) would pick a batch of sentences pairs, consider both Label L and Label R, calculate Loss L and Loss R, aggregate them as the total loss, and update all parameters.

Experiments
To show the effectiveness of the multi-label transfer learning setting, we experiment on three semantic similarity datasets with multiple relations annotated, and use one LSTM-based sentence encoder that has been very successful in many downstream tasks.

Datasets
We study three semantic similarity datasets with multiple relations with texts of different lengths, spanning phrases, sentences, and short paragraphs.
Human Activity Phrase (Wilson and Mihalcea, 2017): a collection of pairs of phrases regarding human activities, annotated with the following four different relations.
• Similarity (SIM): The degree to which the two activity phrases describe the same thing, semantic similarity in a strict sense. Example of high similarity phrases: to watch a film and to see a movie.
• Relatedness (REL): The degree to which the activities are related to one another, a general semantic association between two phrases. Example of strongly related phrases: to give a gift and to receive a present.
• Motivational alignment (MA): The degree to which the activities are (typically) done with similar motivations. Example of phrases with potentially similar motivations: to eat dinner with family members and to visit relatives.
• Perceived actor congruence (PAC): The degree to which the activities are expected to be done by the same type of person. An example of a pair with a high PAC score: to pack a suitcase and to travel to another state.
The phrases are generated, paired and scored on Amazon Mechanical Turk. 3 The annotated input. 3 https://www.mturk.com/ scores range from 0 to 4 for SIM, REL and MA, and −2 to 2 for PAC. The evaluation is based on the Spearman's ρ correlation coefficient between the systems' predicted scores and the human annotations. There are 1,000 pairs in the dataset. We also use the supplemental 1,373 pairs from Zhang et al. (2018) in which 1,000 pairs are randomly selected for training and the rest are used for development. We then treat the original 1,000 pairs as a held-out test set so that our results are directly comparable with those previously reported.
SICK (Marelli et al., 2014b,a): the Sentences Involving Compositional Knowledge benchmark, which includes a large number of sentence pairs that are rich in the lexical, syntactic and semantic phenomena. Each pair of sentences is annotated in two dimensions: relatedness and entailment. The relatedness score ranges from 1 to 5, and Pearson's r is used for evaluation; the entailment relation is categorical, consisting of entailment, contradiction, and neutral. There are 4439 pairs in the train split, 495 in the trial split used for development and 4906 in the test split. The sentence pairs are generated from image and video caption datasets before being paired up using some algorithm. Due to the lack of human supervision in the process, some sentence pairs display minimal difference in semantic components, making the SICK tasks simpler than the others we study.
Typed-Similarity (Agirre et al., 2013b): a collection of meta-data describing books, paintings, films, museum objects and archival records taken from Europeana, 4 presented as the pilot track in the SemEval 2013 STS shared task. Typically, the items consist of title, subject, description, and so on, describing a cultural heritage item and, sometimes, a thumbnail of the item itself. For the purpose of measuring semantic similarity, we concatenate all the textual entries such as title, creator, subject and description into a short paragraph that is used as input, although the annotations might be informed of the image aspects of the meta-data. Each pair of items is annotated on eight dimensions of similarity: general similarity, author, people involved, time, location, event or action involved, subject and description. There are 750 pairs in the train split, of which we randomly sample 500 for training and 250 for development, and 721 in the test split. Pearson's r is used for evaluation.

Baselines
We compare the multi-label setting with two baselines: • Single-task, where each relation is treated as an individual task. For each relation, a model with only one output dense layer is trained and tested, ignoring the annotations of all other relations.
• Multi-task, where only one relation is involved during each round of feed-forward and back-propagation.

Experimental Details
In each experiment, we use stochastic gradient descent and a batch size of 16. We tune the learning rate over {0.1, 0.5, 1, 5} and number of epochs over {10, 20}. For each dataset discussed above, we tune these hyperparameters on the development set. All other hyperparameters maintain their values from the original code. 5 In the single-task setting, the model is trained and tested on each relation, ignoring the annotations of other relations. In the multi-task settings, the model is trained and tested on all the relations in a dataset. In the multitask setting, relations are presented to the model in the order they are listed in the result tables within each batch.

Evaluation
The results are shown in Tables 1, 2 and 3. For every experiment (represented by a cell in the tables), 30 runs with different random seeds are recorded and the average is reported. For each relation (each column in the tables), let the true mean performance of multi-label learning, singletask baseline and multi-task baseline be µ MLL , µ single , µ MTL , respectively. Two one-sided Student's t-tests are conducted to test if multi-label learning outperforms the baselines for that relation. The significance level is chosen to be 0.05. A down-arrow ↓ indicates that our proposed multilabel learning underperforms a baseline, while an up-arrow ↑ indicates that our proposed multi-label learning outperforms a baseline. 5 https://github.com/facebookresearch/InferSent 5 Discussion

Results
For the Human Activity Phrase dataset, the singletask setting already achieves state-of-the-art performances on SIM, REL and PAC relations, surpassing the previous best results reported by Zhang et al. (2018), which achieved Spearman's correlation coefficient of .710 in SIM, .715 in REL, .690 in MA and .549 in PAC. This approach is based on fine-tuning a bi-directional LSTM with average-pooling pre-trained on translated texts (Wieting and Gimpel, 2018). Using multi-label learning, our model is able to gain a statistically significant improvement in the performance of REL compared to the single-task setting, while maintaining performance for the other relations. The traditional multi-task setting, however, performs significantly worse than the other settings.
For the entailment task on the SICK dataset, our multi-label setting outperforms the singletask baseline and the previous best results of In-ferSent. These best results consisted of an accuracy of 86.3% achieved using a logistic regression classifier and sentence embeddings generated by pre-trained InferSent as features (Conneau et al., 2017). In the relatedness task, this setting achieved a Pearson's correlation coefficient of .885, which even our our multi-label setting is unable to beat. However, the multi-label setting does have a statistically significant performance gain compared to the single-task setting in the relatedness task, while the traditional multi-task setting underperforms the other settings.
For the Typed-Similarity dataset, the previous best results are achieved using rich feature engineering without the use of sentence embeddings, with a different scoring scheme for each relation (Agirre et al., 2013a). While this method yielded better results than all of the transfer learning approaches we compare, it should be noted that this approach is specific to tackling this dataset, unlike the transfer learning settings that are generalizable to other scenarios. One potential reason for the discrepancy in performance is that some relations such as time, people involved, or events may be easily or sometimes trivially captured by information retrieval techniques such as named entity recognition. Using sentence embeddings and transfer learning for all the relations, though simpler, may face greater challenge in the rela-   tions mentioned above. Among the three transfer learning approaches, our multi-label setting is still superior, outperforming the single-task setting in over half of the relations, and outperforming the multi-task setting in all relations.

Empirical Recommendation
While our results above show that multi-label learning is almost always the most effective way to transfer sentence embeddings in multi-relational semantic similarity tasks, in some situations simply training with one relation might yield better performance (such as the general similarity relation in the Typed-Similarity dataset). This suggests that the choice of multi-label learning or single-task learning can be tuned as a hyperparameter empirically for the optimal performance on a task.

Other Considerations and Discussions
In the multi-label setting, we calculate the total loss by summing the loss from each dimension. We also explore weighting the loss from each di-mension by factors of 2, 5 and 10, but doing so hurts the performance for all dimensions. In the multi-task setting, we attempt different ordering of the dimensions when presenting them to the model within a batch of examples, but the difference in performance is not statistically significant. Furthermore, the multi-task setting takes about n times longer to train than the multi-label setting, where n is number of dimensions of annotations.

Conclusions
We introduced a multi-label transfer learning setting designed specifically for semantic similarity tasks with multiple relations annotations. By experimenting with a variety of relations in three datasets, we showed that the multi-label setting can outperform single-task and traditional multitask settings in many cases.
Future work includes exploring the performance of this setting with other sentence encoders, as well as multi-label datasets outside of the domain of semantic similarity. This may include NLP datasets annotated with author information for multiple dimensions, or computer vision datasets with multiple annotations for scenes.