DCU-SEManiacs at SemEval-2016 Task 1: Synthetic Paragram Embeddings for Semantic Textual Similarity

We experiment with learning word representations designed to be combined into sentence-level semantic representations, using an ob-jective function which does not directly make use of the supervised scores provided with the training data, instead opting for a simpler objective which encourages similar phrases to be close together in the embedding space. This simple objective lets us start with high-quality embeddings trained using the Para-phrase Database (PPDB) (Wieting et al., 2015; Ganitkevitch et al., 2013), and then tune these embeddings using the ofﬁcial STS task training data, as well as synthetic paraphrases for each test dataset, obtained by pivoting through machine translation. Our submissions include runs which only compare the similarity of phrases in the embedding space, directly using the similarity score to produce predictions, as well as a run which uses vector similarity in addition to a suite of features we investigated for our 2015 Semeval submission. For the crosslingual task, we simply translate the Spanish sentences to English, and use the same system we designed for the monolingual task.

The main ideas we investigate in our systems are: 1. Using a margin-based objective function to train high-quality sentence embeddings without using supervised scores 2. Creating new synthetic training data using machine translation to generate artificial paraphrases 3. Using ensemble models to combine features generated by our embedding networks with features obtained from other sources

Task Description
The Semeval Semantic Textual Similarity (STS) task provides participants with training data consisting of pairs of sentences annotated with gold-standard semantic similarity scores. The crowd-sourced similarity scores are given on a scale from 0 (no relation) to 5 (semantic equivalence). Thus, our aim is to use the training data to learn a model which predicts a score between 0 and 5 for unseen input pairs (Nakov et al., 2015). The monolingual STS task has been organized each year since 2012, and most approaches have viewed the learning task as a regression problem, where real-valued model output is clipped to be 0 <=ŷ <= 5 (Agirre et al., 2013;Agirre et al., 2014;Nakov et al., 2015). For two of our three STS systems, we take a novel approach to this task, and directly use the similarity scores produced by the embedding networks as the predicted score. When training the embedding networks, we use the gold scores only to reduce the task-internal data to segments with a high-semantic similarity -embeddings are then learned using a simplified training objective which only makes use of training pairs which are "perfect" paraphrases (see section 2.1). Interestingly, these models perform very well without access to the gold standard scores.
The 2016 edition of the STS task also introduced a pilot crosslingual STS task in addition to the monolingual STS task. The crosslingual task is similar to monolingual STS, except either member of each sentence pair may be in Spanish (language identification is not provided with the data). In order to use our monolingual STS system in the crosslingual task, we first automatically identify sentences which are probably in Spanish, use machine translation to translate Spanish sentences to English, then approach the crosslingual task as another monolingual task.
Although our systems performed well in both the crosslingual and monolingual STS tasks, we also discuss some possible shortcomings of our approach, and opportunities for improvement.
The rest of the paper is organized as follows: section 2 discusses the main novelties of our submissions, and presents the task-internal and external datasets we leverage for training our systems, section 3 gives a detailed discussion of each of our submitted systems, including information on hyperparameters and training configuration, section 4 gives a summary of experimental results, and section 5 discusses the advantages and disadvantages of our approach, and proposes avenues for future work.

Paragram Vectors and PPDB
Wieting (2015) introduced Paragram-phrase embeddings, which use a novel training objective designed to learn robust sentence-level embeddings which are simple bag-of-words averages of the embeddings in each sequence. Wieting discusses several possible means of encoding a sequence into a vector using shallow and deep feedforward and recurrent networks. Surprisingly, the best performing model is a single-layer word embedding matrix, where sentence vectors are constructed by taking the mean of the token embeddings (equation 1).
In our preliminary experiments, we also experimented with deeper feedforward models, as well as mono-directional and bi-directional Long Short-Term Memory (LSTM) recurrent models (Hochreiter and Schmidhuber, 1997) in place of the simple averaging approach; however, we did not observe an improvement in performance, which supports the results presented by Wieting (2015).
We also experimented with objective functions that are more representative of the task objective, such as Kullback-Leibler Divergence (Tai et al., 2015;Wieting et al., 2015); however, we found that the simple margin-based training objective outperforms cost-functions which take the score into account. We hypothesize that this is because the notion of "partial-similarity" is mostly captured by the bagof-words averaging of all token embeddings to compose the vector representation of a sequence, and because the semantic similarity scores for the STS task are sufficiently coarse that the bulk of their semantic content can be efficiently captured even when all structural information has been discarded.
Equation 2 shows the objective function for the Paragram-phrase model. This function pushes similar examples together, and dissimilar examples apart, driven by the margin δ. g(x) is some differentiable function which transforms a sequence of tokens into a fixed-size vector. This model is simple to implement, and very fast to train 1 . An additional advantage of the margin-based objective function is that the model can learn from any dataset containing pairs of phrases which are semantically equivalent, enabling the use of unsupervised paraphrase data during training. We exploit this flexibility to greatly improve the performance of our models by tuning the Paragram vectors with new data (section 3.2).
The word embeddings are the only parameters of this network. Intuitively, tokens whose embeddings have a high L2 norm in this space contribute more to the semantics of a sentence than those whose norm is low. This simple parameterization has the added advantage that it is very fast to train, relative to other possible architectures, such as monoor bi-directional LSTMs or Gated Recurrent Units (GRUs) (Chung et al., 2015). However, the main advantage of this objective function for our work is that models can now be trained with any dataset consisting of pairs of sequences which are semantically equivalent. Thus datasets such as PPDB can be used to train high-quality embeddings. We start by using the 300-dimensional vectors provided by Wieting (2015). These vectors were trained using the XXL version of PPDB (Ganitkevitch et al., 2013). The sentence embeddings obtained by averaging the raw paragram vectors are used as the development baseline for our systems, and we look for ways to tune the model for the STS task without changing the training objective.
This training objective assumes that each pair is "sufficiently similar" -there is no explicit way to represent partial similarity, since the δ margin dictates that the model should predict a score of at least δ for positive training examples. Therefore, we filter the Semeval STS 2012-2014 training data to contain only those pairs whose similarity is >= 3.8 2 .

Generating Negative Examples
Wieting (2015) discusses two ways of selecting negative examples for the paragram vector training objective. The first is to compute the similarity of x 1 and x 2 with every segment in the training minibatch, choosing the most similar segment t 1 to x 1 and the most similar segment t 2 to x 2 that are not members of the current pair (x 1 , x 2 ), and to use these as the negative training examples for the pair. The second is to alternate between choosing a random negative example and choosing the most similar phrase. Although the approach of choosing the most similar negative example is theoretically satisfying, there are heavy computational costs: this requires at least N vector comparisons for each example in each pair, where N is the size of the minibatch, and the comparisons must be repeated for each training epoch, since the most similar segment may have changed since the previous epoch. Due to the computational overhead associated with computing the most similar example for each example in each minibatch, we opt instead to use randomly chosen segments as the negative examples. Because the random negative examples are re-selected for each epoch, the model also views more data -each time a training pair is seen, the negative examples t 1 and t 2 selected for x 1 and x 2 are different. Intuitively, this should positively contribute to desirable invariance in the learned semantic embeddings; however, we did not validate this empirically.

Synthetic Data Generation
We believe that the requirement for human annotation is the major bottleneck for producing more training data for the STS task. Inspired by the   This approach is obviously dependent upon the quality of the machine translation output; however, if translation from e → f →ê outputs exactly the input e, the new synthetic training example would be of little use. Therefore, the MT systems used for synthetic generation should ideally produce fluent outputê which paraphrases the original input e, but is diverse with respect to the gold-standard reference translations.
In order to validate that adding synthetic data actually improves performance, we generated synthetic paraphrases for the 2015 Images dataset, and compared performance with respect to the Paragram baseline, and with respect to a model trained with only the task-internal data. These experiments confirmed that synthetic data generation can significantly improve performance. During development, we did not test performance on all of the 2015 data because the process of generating paraphrases is time-consuming, and because we wanted to keep our usage of the Google Translate API within the free credit allocated for test usage of the API, to ensure that our results can be easily replicated.

Semantic Textual Similarity
We submit three systems to the monolingual STS task. The first system is an ensemble of features from our 2015 submission, together with two features produced by the embedding systems. The second system uses the task-internal data from Semeval 2012-2015 to tune the Paragram embeddings for the STS task. The third system includes one synthetic paraphrase for each sentence in each test dataset, generated by first translating the sentence into Spanish, then back into English. Note that, due to time constraints we did not tune a separate model for each test dataset, instead we used one model trained with all synthetic paraphrases from all test datasets. We believe that training a separate model for each test dataset with synthetic data for only that dataset would improve performance somewhat. Because the scores of the embedding models are in the range 0-1, we scale the outputs by a factor of 5 to match the Semeval scoring system.

Ensemble Model
In order to test the use of embedding model similarity scores as downstream features, we train an ensemble system with all features from our 2015 submission along with the similarity scores generated  by both the task-internal and synthetic models. This ensemble, which we call "fusion" was our best system overall (see table 2). We used gradient boosting regressor 3 model trained over combined set of previous year's Semeval STS data sets from 2012-2015.
The details of this system are described in (Arora et al., 2015).

Cross-lingual
For our submission to the cross-lingual STS task, we leverage the Google translate API 4 in three ways: the language identification API is used to detect which segments are in Spanish, the translation API is used to translate Spanish sentences into English, and the pivoting method for generating synthetic paraphrases discussed in section 3.2 is used to generate one new paraphrase for each segment in each test instance. We then apply our monolingual embedding methodology to the translated text with no modification.

Training Configuration
For the final systems, we use all existing taskinternal training data from the Semeval STS task from 2012-2015. The 2015 datasets were used as validation data to find the best system settings, and then included in the training data for the final systems.  The pretrained paragram vectors have a size of 42091; we do not add any new tokens to the index. Table 5 gives the total percentage of each 2016 test dataset which is unknown with respect to our index. Because the baseline paragram vector index does not contain a special "UNKNOWN" token, we randomly choose a low-frequency token to assign as unknown. Some experimentation showed that using a rare token instead of a stopword results in a small performance improvement.

Results
Our monolingual STS systems all performed better than the median system, with the fusion system slightly outperforming the embedding model trained with synthetic data (see tables 2 and 3).
For all systems in both the monolingual and crosslingual STS tasks, we observe an overall improvement over the Paragram baseline when using task-internal training data, and a further improvement when we incorporate synthetic training examples. This result validates the utility of synthetic paraphrases generated by machine translation, and encourages us to explore this avenue further.
For our ensemble based approach, we analyzed the features using Gini importance 5 (Singh et al., 2010). Table 4 shows the importance of the top 10 features in our ensemble model. The relative impact of our two paraphrase features is very high, confirming the utility of the paragram embedding model for the STS task.

Conclusions
We have presented a method of fine-tuning Paragram-phrase vectors for the STS task using both task-internal and synthetic paraphrases. Our embedding models achieve surprisingly good performance on the STS task without directly taking advantage of the gold-standard scores during training. We have also introduced a novel method of generating synthetic paraphrases for test instances using machine translation. Finally we have shown that a combination of traditional features with the similarity score learned by our embedding models outperforms each individual system. Future work will focus on increasing the diversity of the synthetic data, and on incorporating multiple objective functions into different stages of the training process.