Ebiquity: Paraphrase and Semantic Similarity in Twitter using Skipgrams

We describe the system we developed to participate in SemEval 2015 Task 1, Paraphrase and Semantic Similarity in Twitter . We create similarity vectors from two-skip trigrams of preprocessed tweets and measure their semantic similarity using our UMBC-STS system. We sub-mit two runs. The best result is ranked eleventh out of eighteen teams with F1 score of 0.599.


Introduction
In this task (Wei, et al., 2015), participants were given pairs of text sequences from Twitter trends and produced a binary judgment for each stating whether or not they are paraphrases (e.g., semantically the same) and optionally a graded score (0.0 to 1.0) measuring their degree of semantic equivalence. For example, for the trending topic "A Walk to Remember" (a film released in 2002), the pair A Walk to Remember is the definition of true love" and "A Walk to Remember is on and Im in town and Im upset" might be judged as not paraphrases with score 0.2 whereas the pair "A Walk to Remember is the definition of true love" and "A Walk to Remember is the cutest thing" could be judged as paraphrases with a score of 0.6.
Many methods have been proposed to solve the paraphrase detection problem. Early approaches were often based on lexical matching techniques, e.g., word n-gram overlap (Barzilay and Lee, 2003) or predicate argument tuple matching (Qiu, et al., 2006). Some other approaches that go beyond simple lexical matching have also been developed. For example, (Mihalcea, et al., 2006) estimated semantic similarity of sentence pairs with word-to-word similarity measures and a word specificity measure. (Zhang and Patrick, 2005) uses text canonicalization to transfer texts of similar meaning into the same surface text with a higher probability than those with different meaning.
Many of these approaches adopt distributional semantic models, but limited to a word level. To extend distributional semantic models beyond words, several researchers have learned phrase or sentence representation by composing the representation of individual words (Mitchell and Lapata, 2010;Baroni and Zamparelli, 2010). An alternative approach by (Socher et al., 2011) represents phrases and sentences with fixed matrices consisting of pooled word and phrase pairwise similarities. (Le and Mikolov, 2014) learns representation of sentences directly by predicting context without composition of words.
In our work, we judge that two sentences are paraphrases if they have high degree of semantic similarity. We use the UMBC-Semantic Textual Similarity system (Lushan , which provides high accurate semantic similarity measurement. The remainder of this paper is organized as follows. Section 2 describes the task and the details of our method. Section 3 presents our re-sults and a brief discussion. The last section offers conclusions.

Our Method
To decide whether two tweets are paraphrases or not, we use a measurement based on semantic similarity values. If two tweets are semantically similar, they are judged as paraphrases, otherwise they are not. We described steps of our method as follows.

Preprocessing
Generally, tweets are informal text sequences that include abbreviations, neologisms, emoticons and slang terms as well genre-specific elements such as hashtags, URLs and @mentions of other Twitter accounts. This is due to both the informal nature of the medium and the requirement to limit content to at most 140 characters. Thus, before measuring the semantic similarity, we replace abbreviation and slang to the readable version. We collected about 685 popular abbreviations and slang terms from several Web resources 1 and combined these with the provided twitter normalization lexicon developed by Han Bo and Timothy Baldwin (2011).
After replacing abbreviations and slang terms, we remove all stop words to get our final desired processed tweets. Then we produce a set of twoskip trigrams for each tweet and name these sets as trigram sets. We adapted the skip-gram technique from (Guthrie, et al., 2006). Take the tweet "Google Now for iOS simply beautiful" as an example, after removing stop word s, we get 'Google Now iOS simply beautiful'. Then a two-skip trigram set is produced: {'Google Now iOS', 'Now iOS simply', 'iOS simply beautiful', 'Google iOS simply', 'Google simply beautiful', 'Now simply beautiful', 'Google Now beautiful', 'Google Now simply', 'Now iOS beautiful'}, which is referred as trigram set. We transform every raw tweet into its processed version and then corresponding trigram set.

LSA Word Similarity Model
Our LSA word similarity model is a revised version of the one we used in the 2013 and 2014 SemEval semantic text similarity tasks , Kashyap et al., 2014. LSA relies on the fact that semantically similar words (e.g., cat and feline or nurse and doctor) are more likely to occur near one another in text. Thus evidence for word similarity can be computed from a statistical analysis of a large text corpus. We extract raw word cooccurrence statistics from a portion of a 2007 Stanford WebBase dataset (Stanford, 2001).
We performed part of speech tagging and lemmatization on the corpus using the Stanford POS tagger (Toutanova et al., 2000). Word/term cooccurrences were counted with a sliding window of fixed size over the entire corpus. We generate two co-occurrence models using window sizes ±1 and ±4. The smaller window provides more precise context which is better for comparing words of the same part of speech while the larger one is more suitable for computing the semantic similarity between words of different syntactic categories. Our word co-occurrence models are based on a predefined vocabulary of 22,000 common English open-class words and noun phrases, extended with about 2,000 verb phrases from WordNet. The final dimensions of our word/phrase co-occurrence matrices are 29,000×29,000 when words/phrases are POS tagged. We apply singular value decomposition on the word/phrase co-occurrence matrices (Burgess 1998) after transforming the raw word/phrase co-occurrence counts into their log frequencies, and select the 300 largest singular values.
The LSA similarity between two words/phrases is then defined as the cosine similarity of their corresponding LSA vectors generated by the SVD transformation.
To compute the semantic similarity of two text sequences, we use the simple align-and-penalize algorithm described in ) with a few improvements. These improvements include some sets of common disjoint concepts and an enhanced stop word list.

Features
For two trigram sets, we compute the semantic similarity of every possible pair of trigrams in these two sets using the UMBC Semantic Textual Similarity system. For each pair of tweet (T1 and T2), six features are produced as: • Feature1 = semantic similarity value between each pair of tweets (whole sentence with abbreviation and slangs replaced, and stop words removed) = the weighted average on length of tweets of two averages above.

Training
We used the LIBSVM system (Chang and Lin, 2011) for training a logistic regression model and a support vector regression model. We run a grid search to find the best parameters for both models. All training data (13,063 pairs of tweets) were used to train the models without discarding any debatable data. We tested the contribution for of each of the features through ablation experiments on the development data in which each feature was deleted in each experimental run. Since the performance of both systems is almost the same, we decide to submit one run of each system.

Results and Discussions
We submit two runs: Run 1 (Logistic Regression) obtained an F1 score of 0.599, precision score of 0.651 and recall score of 0.554, and Run 2 (Support Vector Regression), which received an F1 of 0.590, precision of 0.646, and recall of 0.543. When ranked, we are in the eighteenth (Run 1 ) and the nineteenth (Run 2 ) out of the 38 runs. The first rank has F1 score of 0.674. The full distribution of F1 score is shown in Figure 1. The relatively low ranking of our system might be the result of several factors.
First factor is the prevalence of neologisms, misspellings, informal slang and abbreviations in tweets. Better preprocessing to make the tweets closer to normal text might improve our results.
Another factor is the UMBC STS system. Examples of input on which UMBC STS system perform poorly are shown in Table 3. We can group these into two sets, each associated with problem in performing the paraphrase task.
The first problem is that a slang word may have different meanings when it is used in different genres. As we can see in the first example in Table 3, 'bombs' does not mean 'a container filled with explosive' but is a synonym of 'home runs' when mentioned in a sports or baseball context. We can recognize this meaning by reading sport articles but it is not included in any dictionaries or WordNet. Thus our system predicts that the two tweets, each containing either 'bombs' or 'home runs', have low semantic similarity and thus are not paraphrases.
The second problem involves out-of-vocabulary words, such as the named entities found in the examples in Table 3. Tweet 2 of the second example

'NOW YOU SEE ME and AFTER EARTH Cant
Outpace FAST FURIOUS 6' is full of movie names whose meanings our STS system cannot recognize. We can solve this problem by adding name entity recognition to the system. Another potential solution would be to adopt a simple string-matching component. With string matching, we may handle those out-of-vocabulary words situations similar to the third and fourth example. We can match 'orr' and 'chara' between two tweets of the third example and 'new ciroc' in the fourth example.
To improve our STS performance, which is trained on a corpus that mostly consisted of reasonably well-written narrative text, we need to expand training corpus. Training a LSA model on a collection of tweets or a mixture of tweets and narrative text, and adding name entity recognition process may lead to better results.   Table 3. Examples of input pairs on which our system performed poorly

Conclusion
We describe our system submitted in participating the SemEval 2015 Task 1 Paraphrase and Semantic Similarity in Twitter. We preprocess tweets using two-skip trigrams to produce sets of possible trigrams and measure their semantic similarity using the UMBC-STS system. We computed the statistical value as maximum and average of each pair and use two regression models; logistic regression and support vector regression. Our best performing run achieved an F1 score of 0.599 and was ranked eleventh out of eighteen teams.