Unsupervised Sparse Vector Densification for Short Text Similarity

Sparse representations of text such as bag-of-words models or extended explicit semantic analysis (ESA) representations are commonly used in many NLP applications. However, for short texts, the similarity between two such sparse vectors is not accurate due to the small term overlap. While there have been multiple proposals for dense representations of words, measuring similarity between short texts (sentences, snippets, paragraphs) requires combining these token level similarities. In this paper, we propose to combine ESA representations and word2vec representations as a way to generate denser representations and, consequently, a better similarity measure between short texts. We study three densiﬁcation mechanisms that involve aligning sparse representation via many-to-many, many-to-one, and one-to-one mappings. We then show the effectiveness of these mechanisms on measuring similarity between short texts.


Introduction
Bag-of-words model has been used for many applications as the state-of-the-art method for tasks such as document classifications and information retrieval. It represents each text as a bag-of-words, and computes the similarity, e.g., cosine value, between two sparse vectors in the high-dimensional space. When the contextual information is insufficient, e.g., due to the short length of the document, explicit semantic analysis (ESA) has been used as a way to enrich the text representation (Gabrilovich and Markovitch, 2006;Gabrilovich and Markovitch, 2007). Instead of using only the words in a document, ESA uses a bag-of-concepts retrieved from Wikipedia to represent the text. Then the similarity between two texts can be computed in this enriched concept space.
Both bag-of-words and bag-of-concepts models suffer from the sparsity problem. Because both models use sparse vectors to represent text, when comparing two pieces of texts, the similarity can be zero even when the text snippets are highly related, but make use of different vocabulary. We can expect that these two texts are related but the similarity value does not reflect that. ESA, despite augmenting the lexical space with relevant Wikipedia concepts, still suffers from the sparsity problem. We illustrate this problem with the following simple experiment, done by choosing a documents from the "rec.autos" group in the 20-newsgroups data set 1 . For both documents and the label description "cars" (here we follow the description shown in (Chang et al., 2008;Song and Roth, 2014)), we computed 500 concepts using ESA. Then we identified the concepts that appear both in the document ESA representation and in the label ESA representation. The average sizes of this intersection (number of overlapping concepts in the document and label representation) are shown in Table 1. In addition to the original documents, we also split each document into 2, 4, 8, 16 equal length parts, computed the ESA representation of each, and then the intersection with the ESA representation of the label. Table 1 shows that the number of concepts shared by the label and the document representation decreases significantly, even if not as significantly as the drop in the document size. For example, there are on average 8 concepts in the intersection of two vectors with 500 non-zero concepts when we split each document into 16 parts. When there are fewer overlapping terms between two pieces of texts, it can cause mismatch or biased match and result in less accurate comparison. In this paper, we propose to use unsupervised approaches to improve the representation, along with a corresponding similarity approach between these representations. Our contribution is twofold. First, we incorporate the popular word2vec Mikolov et al., 2013b) representations into ESA representation, and show that incorporating semantic relatedness between Wikipedia titles can indeed help the similarity measure between short texts. Second, we propose and evaluate three mechanisms for comparing the resulting representations. We verify the superiority of the proposed methods using three different NLP tasks.

Sparse Vector Densification
In this section, we introduce a way to compute the similarity between two sparse vectors by augmenting the original similarity measure, i.e., cosine similarity. Suppose we have two vectors x = (x 1 , . . . , x V ) T and y = (y 1 , . . . , y V ) T where V is the vocabulary size. Traditional cosine similarity computes the dot product between these two vectors and normalizes it by their norms: cos(x, y) = x T y ||x||·||y|| . This requires each dimension of x to be aligned with the same dimension of y. Note that for sparse vectors x and y, most of the the elements can be zero. Aligning the indices can result in zero similarity even though the two pieces of texts are related. Thus, we propose to align different indices of x and y together to increase the similarity value.
We can rewrite the vectors x and y as x = {x a 1 , . . . , x an x } and y = {y b 1 , . . . , y bn y }, where a i and b j are indices of the non-zero terms in x and y (1 ≤ a i , b j ≤ V ). x a i and y b i are the weights associated to the terms in the vocabulary. Suppose there are non-zero terms n x and n y in x and y respectively. Then cosine similarity can be rewritten as: where δ(·) is the Dirac function δ(0) = 1 and δ(other) = 0. Suppose we can compute the similarity between terms a i and b j , which is denoted as φ(a i , b j ), then the problem is how to aggregate the similarities between all a i 's and b j 's to augment the original cosine similarity.

Similarity Augmentation
The most intuitive way to integrate the similarities between terms is averaging them: (2) This similarity averages all the pairwise similarities between terms a i 's and b j 's. However, we can expect a lot of the similarities φ(a i , b j ) to be close to zero. In this case, instead of introducing the relatedness between nonidentical terms, it will also introduce noise. Therefore, we also consider an alignment mechanism that we implement greedily via a maximum matching mechanism: (3) We choose j as argmax j φ(a i , b j ) and substitute the similarity φ(a i , b j ) between terms a i and b j into the final similarity between x and y. Note that this similarity is not symmetric. Thus, if one needs a symmetric similarity, the similarity can be computed by averaging two similarities S M (x, y) and S M (y, x).
The above two similarity measurements are simple and intuitive. We can think about S A (x, y) as leveraging term many-to-many mapping, while S M (x, y) uses only one-to-many term mapping. S A (x, y) can introduce small and noisy similarity values between terms. While S M (x, y) essentially aligns each term in x with it's best match in y, we run the risk that multiple components of x will select the same element in y. To ensure that all the non-zero terms in x and y are matched, we propose to constrain this metric by disallowing many-to-one mapping. We do that by using a similarity metric based on the Hungarian method (Papadimitriou and Steiglitz, 1982). The Hungarian method is a combinatorial optimization algorithm that solves the bipartite graph matching problem by finding an optimal assignment matching the two sides of the graph on a one-to-one basis. Assume that we run the Hungarian method on the the pair {x, y}, and let h(a i ) = b j denote the outcome of the algorithm, that is a i is aligned with b j . (We assume here, for simplicity, that n x = n y ; we can always achieve that by adding some zero weighted terms that are not aligned). The we define the similarity as: (4)

Term Similarity Measure
To evaluate the term similarity φ(·, ·), we use local contextual similarity based on distributed representations. We adopt the word2vec (Mikolov et al., 2013a;Mikolov et al., 2013b) approach to obtain a dense representation of words. The representation of each word is predicted based on the context word distribution in a window around it. We trained word2vec on the Wikipedia dump data using the default parameters (CBOW model with window size as five). For each word, we finally obtained a 200 dimensional vector. If the term is a phrase, we simply average words' vectors of each phrase to obtain the representation following the original word2vec approach Mikolov et al., 2013b). We use two vectors a and b to represent the vectors for the two terms. To evaluate the similarity between two terms, for the average approach as Eq.
(2), we use the RBF kernel over the two vectors exp{−||a − b|| 2 /(0.03·||a||·||b||)} as the similarity for all the experiments, since this will have a good property to cut the terms with small similarities. For the max and Hungarian approach as Eqs. (3) and (4), we simply use the cosine similarity between the two word2vec vectors. In addition, we cut off all similarities below threshold γ and map them to zero.

Experiments
We experiment on three data sets. We use dataless classification (Chang et al., 2008;Song and Roth, 2014) over 20-newsgroups data set to verify the correctness of our argument of short text problems, and use two short text data sets to evaluate document similarity measurement and event classification for sentences.

Dataless Classification
Dataless classification uses the similarity between documents and labels in an enriched "semantic" space to determine in which category the given document is. In this experiment, we used the label descriptions provided by (Chang et al., 2008). It has been shown that ESA outperforms other representations for dataless classification (Chang et al., 2008;Song and Roth, 2014). Thus, we chose ESA as our  Figure 2: Boxplot of similarity scores for "rec.autos vs. sci.electronics" (easy, left) and "rec.autos vs. rec.motorcycles" (difficult, right). For each method of ESA and Dense-ESA with max matching in Eq. (3), we compute S(d, l 1 ) and S(d, l 2 ) between a document d and the labels l 1 and l 2 . Then we compute S(d) = S(d, l 1 )−S(d, l 2 ). For each ground truth label, we draw the distribution of S(·) with outliers in the figures. For example, "ESA:autos" shows the S(·)'s distribution of the data with label "rec.autos." The t-test results show that the distributions of different labels are significantly different (99%). We can see that Dense-ESA pulls apart the distributions of different labels and that the separation is more significant for the more difficult problem (right). baseline method. To demonstrate how the length of documents affects the classification result, we used both full documents and the 16 split parts (the parts are associated with the same label as the original document). To demonstrate the impact of densification, we selected two problems as an illustration: "rec.autos vs. sci.electronics" and "rec.autos vs. rec.motorcycles." While the former problem is relatively easy since they belong to different superclasses, the latter problem is more difficult since they are under the same super-class. The value of threshold γ for max matching and Hungarian based densification is set to 0.85 empirically. Figure 1 shows the results of the dataless classification using ESA and ESA with densification (Dense-ESA) with different numbers of Wikipedia concepts as the representation dimensionality. We can see that Dense-ESA significantly improves the dataless classification results. As shown in Table 2, while the max matching and Hungarian matching based methods are typically the best metrics the most significant results, the improvements are more significant for shorter documents, and for more difficult problems. Figure 2 highlights this observation.

Document Similarity
We used the data set provided by Lee et al. 2 (Lee et al., 2005) to evaluate pairwise short document similarity. There are 50 documents and the average number of words is 80.2. We averaged all the human annotations for the same document pair as the similarity score. After computing the scores for pairs of documents, we used Spearman's correlation to evaluate the results. Larger correlation score means that the similarity is more consistent with human annotation. The best word level based similarity result is close to 0.5 (Lee et al., 2005). We tried the cosine similarity between ESA representations and also Dense-ESA. The value of γ for max matching based densification is set to 0.95, and for Hungarian based densification it is set to 0.89. We can see that from Table 3, ESA is better than the word based method, and that all versions of Dense-ESA outperform the original ESA.

Event Classification
In this experiment, we chose the ACE2005 3 data set to test how well we can classify sentences into event types without any training. There are eight types of events: life, movement, conflict, contact, etc. We chose all the sentences that contain event information as the data set. Following the dataless classification protocol, we compare the similarity between sentences and label descriptions to determine the event types. There are 3,644 unique sentences with events, including 2,712 sentences having only one event type, 421 having two event types, and 30 having three event types. The average length of the sentences is 23.71. Thus, this is a multi-label classification problem. To test the approaches, we used five-fold cross validation to select the thresholds for each class to classify whether the sentence belongs to an event type. The value of threshold γ for both max matching and Hungarian based densification is also set to 0.85 empirically. Then we report the mean and standard derivation over five runs. The results are shown in Table 4. We can see that Dense-ESA also outperforms ESA.

Conclusion
In this paper, we study the mechanisms of combining two popular representations of text, i.e., E-SA and word2vec, to enhance computing short text similarity. Furthermore, we proposed three different mechanisms to compute the similarity between these representations, and demonstrated, using three different data sets that the proposed method outperforms the traditional ESA.