ASOBEK at SemEval-2016 Task 1: Sentence Representation with Character N-gram Embeddings for Semantic Textual Similarity

A growing body of research has recently been conducted on semantic textual similarity using a variety of neural network models. While recent research focuses on word-based representation for phrases, sentences and even paragraphs, this study considers an alternative approach based on character n-grams. We generate embeddings for character n-grams using a continuous-bag-of-n-grams neural network model. Three different sentence representations based on n-gram embeddings are considered. Results are reported for experiments with bigram, trigram and 4-gram embeddings on the STS Core dataset for SemEval-2016 Task 1.


Introduction
This paper presents an approach for finding the degree of semantic similarity between sentence pairs. Semantic textual similarity (STS) is of relevance to many NLP applications. Recent tasks in recognizing textual entailment, sentence completion and paraphrase identification are closely related. The approach described here makes use of a neural network (NN) algorithm (word2vec) that is typically used to generate word embeddings (Mikolov et al., 2013a). Rather than generating vector representations for words however, we propose a character n-gram-to-vector approach. A sentence is then represented as a vector generated through a combination of character n-gram embeddings.
The use of character level vectors has been proposed in a number of recent studies. Subword language models that use the combination of characters, syllables and frequent words have been explored by Mikolov et al. (2012). Character-level language modeling has been performed for modeling OOV words, where using words as the atomic units of the model would not be sufficient to assign a probability score. In Ling et al. (2015), word representations are composed of vectors of characters, called character to word (C2W). The C2W vectors are used successfully for language modeling and POS tagging without any handcrafted features. The resulting model is competitive on English POS tagging with the Stanford POS tagger word lookup tables and also produces a notable improvement in results for morphologically rich languages such as Turkish. Kim et al. (2015) apply a simple convolutional neural network model, which uses character level inputs for word representations. Again, this method outperforms the models that use word/morpheme level features in morphologically rich languages, while also having competitive results in English. Huang et al. (2013) introduce a word hashing technique using character n-grams to scale up training of deep NN models for largescale web search applications.
Our motivation for exploring character n-grams derives in part from previous work we have conducted on paraphrase identification (PI). The PI task is that of deciding whether two sentences have the same meaning. We have shown that sentence representations based on bags or sets of character n-gram features can perform well at this task (Eyecioglu and Keller, 2015). It is hypothesized that n-grams are useful for capturing lexical similarity and perform a role similar to lemmatization whilst preserving differences. Thus, sentence representations based on collections of n-grams as features may offer some advantages over representations based on words as features.
The current study aims to extend this earlier work by working with n-gram embeddings. (Mikolov et al., 2013a) introduced two new NN architectures that were applied to learning word embeddings: continuous-bag-of-words (CBOW) and skip grams (SG). These NN models have been shown to perform well in many NLP areas such as STS and PI (Zarrella et al., 2015;Yin and Schütze, 2015;He et al., 2015). Recent research has taken steps to extend these word vectors to sentences, paragraphs and documents (Le and Mikolov, 2014).
We introduce an alternative approach to obtaining sentence level embedding vectors that make use of character n-grams rather than words. We adopt a model architecture that is analogous to CBOW, which we call continuous-bag-of-n-grams and notate as CBOnG throughout the paper. In keeping with our earlier work on paraphrase identification (Eyecioglu and Keller, 2015), preprocessing is kept to a minimum and no use is made of any manually constructed semantic or syntactic processing tools or resources.
Operationally, STS is similar to paraphrase identification. The two tasks differ in that STS sentence pairs are assigned a degree of semantic equivalence instead of a binary paraphrase label. STS shared tasks have produced a sizable amount of research on sentence similarity (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015.

The Task
SemEval-2016 Task 1: Semantic Textual Similarity (Agirre et al., 2016) requires systems to determine the degree of semantic similarity between pairs of sentences. Similarity scores are on a scale from 0 (completely dissimilar) to 5 (semantically equivalent). We participated in the monolingual STS Core subtask. This subtask includes English language evaluation data from multiple sources organized into 5 distinct evaluation datasets: Plagiarism Detection, Q&A Question-Question, Q&A Answer-Answer, Post-Edited Machine Translations and Headlines. The evaluation data have a gold standard similarity score based on human judgements collected using Crowdsourcing. The Pearson correlation between similarity scores assigned by the systems and the human judgements is used to assess task performance.

Approach
The sentences within the STS pairs are split into character n-grams. No preprocessing is applied besides lowercasing of the text and removing punctuation.
Our training procedure was unsupervised, using only a large unlabelled dataset drawn from Wikipedia. These data are used to train a CBOnG model. We explore using three different methods for constructing sentence level vector representations from the character embeddings. STS scores for the sentence pairs are computed as the cosine similarity of the resulting sentence level embedding vectors. Three different cosine similarity scores, one from each representation, are obtained.

Wikipedia Dataset
We used a dump of English Wikipedia articles 1 that includes 3,831,719 articles and 8,179,596 unique words. Wikipedia provides a large and accessible collection of text consisting of wellformed sentences that is suitable for training purposes. We obtained a plain text representation of the documents by removing the data dump xml tags using the script provided in the Gensim package (Rehurek and Sojka, 2010). Matching the minimal pre-processing performed on the STS pairs, training data are only lowercased and cleared of punctuation. The Wikipedia dataset is split into adjacent n-grams and spaces between words replaced with the "-" symbol. For example, the following sequence of trigrams would be produced from the text amazon gift.
-am ama maz azo zon on-n-g -gi gif ift ft-

Constructing a CBOnG Model
Our CBOnG is constructed and trained identically to a CBOW model (Mikolov et al., 2013a) but substituting character n-grams in place of words. The quality of such a model is affected by a number of hyper-parameters such as the size of the character n-gram vectors (embeddings), the size of the training window, and the cut-off point for less frequent n-grams. Although experiments by Mikolov et al. (2013b) describe how to choose the appropriate features for a word similarity task, the same fea-tures might not be ideal here. Moreover, our model brings an additional new modeling parameter that defines the size of the character n-grams.
Embedding models scale linearly with the number of unique unfiltered tokens in the training data. Each token has a corresponding fixed size embedding vector. Mikolov et al. (2013a) explored using embedding vector sizes ranging from 20 to 600. In general, the results showed that increasing the dimensionality of vectors and the size of training data, improved the accuracy on semanticsyntactic word relationship task using otherwise identical features.
For our experiments, we make use of 400 dimension embedding vectors. Our CBOnG model was trained using surrounding n-grams within a window size of 5. Character n-grams that occur fewer than 5 times are filtered. Our model is trained using the Gensim package (Rehurek and Sojka, 2010) and its support for CBOW (wordlevel) models but over data tokenized into character n-grams.

Compositions of Vector Representations
The construction of embedding vectors for phrases and sentences directly from word embeddings is still an active area of research. As noted in Section 2, there have been various efforts to build good embedding representations of text beyond those tied to individual words. We believe the diversity of methods for constructing textual embedding representations beyond the word level is due to the fact the appropriateness of the various representations is very task dependent.
One simple approach is to construct textual embedding representations using either point-wise addition multiplication of the embedding vectors representing individual words and phrases. The resulting representations have been shown to work well for phrase similarity and PI tasks (Blacoe and Lapata, 2012).
We make use of a similar composition algorithm for our sentence level embeddings, but using character n-gram rather than word embeddings. We describe three different methods for combining n-gram embeddings. The first is formed by addition of the embeddings of the n-gram tokens in a sentence, effectively weighting the embeddings by their frequency. The second and third are based on n-gram types, and formed by concatenation and weighted addition, respectively.

Sentence Representations
For the STS task, the core experimental unit is a pair of sentences. A target sentence, , consisting of some sequence of n tokens (i.e. character ngrams) ( ! ! , ! ! , … , ! ! ) is paired with another sentence, ′ that consists of some sequence m of tokens ( !! ! , !! ! , … , !! ! ). Note that the numbers of tokens for each sentence do not necessarily need to be equal. In the following, a vector embedding associated with a token is notated by ! . We consider three different vector based sentence representations built from n-gram embeddings.
For a given sentence , the first representation is produced through element-wise addition of the vectors associated with each token in . The result of this operation is a vector ! ! having the specified vector size of the token embeddings, as is defined in advance to building the models (i.e. 400 in this case).
For the second approach, is represented as a matrix of size 400× , where is the size of the model vocabulary (i.e. the number of unique ngrams using in training). Combining the d vectors in order forms a vector representation: where ! ! ! is the embedding associated with the i th term ! in the model vocabulary if ! occurs in , and is the null vector if the i th term does not occur in .
Finally, for the third representation, elementwise addition of all of the embeddings in ! ! is computed. We obtain a new vector of ! ! the dimension of the embedding vector size, which is specified as 400: We note that the difference between Run1 and Run3 is that for Run1, the contributions of the ngram embeddings are weighted for frequency.

Experiments
To obtain semantic similarity scores for each pair of sentences and ′, the cosines of the corresponding vector representations are computed. Run1: Previous work on the use of character n-grams for PI has shown that trigrams perform well. The three runs chosen for submission to SemEval 2016's STS task use sentence representations constructed from trigram embeddings. Wikipedia articles were pruned using the Gensim package with default parameters. A total of 108,452 articles are used to construct a character-trigram model for the experiments. The statistical properties of the Wikipedia-trained model using character trigrams are shown in Table 1   Further experiments were conducted for sentence representations based on bigrams and 4-grams. Although these were not submitted for the task, the results are also reported in the following section.

Results
The Pearson Correlation results from our three different runs obtained using trigram embeddings are presented in Table 2. These results represent the ASOBEK submission to SemEval-2016 Task 1. It is noted that Run2 and Run3 generally appear to outperform Run1, with Run2 performing best overall. The best individual result is obtained on the Post-editing dataset. This result is ranked above the median for results reported on this sub-task. The correlation scores evidence variable performance across the individual datasets. Most notable is that the results from each of the runs applied on the Question-Question dataset are much lower relative to the other categories. In spite of this, the overall performance is 0.6178 with Run2.  Experiments were also conducted for bigrams and 4-grams. Results are shown in Table 3. Highlight-ed scores indicate cases where performance exceeds that obtained using trigram embeddings. As for trigrams, better results are generally obtained for Run2 and Run3. For 4-grams, the overall performance of Run3 actually exceeds that based on trigrams. Across all of the systems, results on the Question-Question dataset depress the overall performance significantly. This is especially evident for Run2 of the 4-gram dataset, where there is little correlation evident between system scores and human judgements. In contrast, using bigrams Run2 produces our best result for this dataset. Examination of the data indicates that this may be due to the particular form of the questions. For example, character n-grams generated from the following pair will contain many tokens that are not informative in discriminating meaning:

Conclusions
A method for STS based on embeddings of character n-grams generated by a CBOnG model was introduced. This is the only study that we are aware of that utilizes embeddings of character ngrams to build representations of sentences. The study presents preliminary results showing that the approach can successfully help identify semantic similarity of sentence pairs. Using our method, we observe significant variations in performance across the STS Core datasets. In particular, performance is generally poor for the Question-Question dataset. This suggests that it may improve performance to weight the contributions of the embeddings according to the informativeness of the associated n-grams. We intend to consider this in future experiments.