TrWP: Text Relatedness using Word and Phrase Relatedness

Text is composed of words and phrases. In bag-of-word model, phrases in texts are split into words. This may discard the inner semantics of phrases which in turn may give inconsistent relatedness score between two texts. TrWP , the unsupervised text relatedness approach combines both word and phrase relatedness. The word relatedness is computed using an existing unsupervised co-occurrence based method. The phrase relatedness is computed using an unsupervised phrase relatedness function f that adopts Sum-Ratio technique based on the statistics in the Google n-gram corpus of overlapping n-grams associated with the two input phrases. The second run of TrWP ranked 30th out of 73 runs in SemEval-2015 task2a (English STS).


Introduction
Generally, a phrase is an ordered sequence of multiple words that all together refer to a particular meaning (Zamir and Etzioni, 1999). Phrase relatedness quantifies how two phrases relate to each other. It plays an important role in different Text Mining tasks; for instance, document similarity 1 , classification and clustering are performed on the documents composed of phrases. Several document clustering methods use phrase similarity to determine the similarity between documents so as to improve the clustering result (Chim and Deng, 2008;Shrivastava et al., 2013). SpamED (Pera and Ng, 2009) uses the bi-gram and tri-gram phrase similarity between an incoming e-mail message and a previously marked spam to enhance the accuracy of spam detection.
Most works on text relatedness can be abstracted as a function of word relatedness (Ho et al., 2010). The classical Bag-of-Word (BoW) text relatedness methods split phrases into words; then compute textpair relatedness by word-pair relatedness (Islam and Inkpen, 2008;Islam et al., 2012;Tsatsaronis et al., 2010). T rW P considers text as Bag-of-Word-and-Phrase (BoWP). It considers a (word, bi-gram) or (bi-gram, bi-gram) pair as a phrase-pair 2 and computes text relatedness using both word and phrase relatedness.
There are phrase relatedness tasks that use compositional distributional semantic (CDS) model (Annesi et al., 2012;Hartung and Frank, 2011). Some use different tools and knowledge-based resources (Han et al., 2013;Tsatsaronis et al., 2010). These methods split phrases into words without considering the word order that might change the meaning of phrases leading to inconsistent phrase relatedness score (Turney and Pantel, 2010). For example, if we split the phrases "boat house" and "house boat" into words, we get the relatedness score one, nonetheless as a whole unit, these two phrases do not refer to exactly the same meaning (Turney and Pan tel, 2010). To preserve the phrase meaning, T rW P uses the phrase relatedness function f that considers a phrase as a single unit.

Terminology used in Phrase Relatedness
The terminologies used in measuring phrase relatedness are described below.

Bi-gram Context
Bi-gram context is a bi-gram, extracted by placing a phrase in the left most, middle and right position within the Google n(=3,4)-grams. Sample bi-gram contexts for the bi-gram phrase "large number" are shown in Table 1 Table 1: Positions of the bi-gram phrase ("large number") in Google 4-grams and corresponding bigram contexts marked bold.

Overlapping Bi-gram Context
The overlapping bi-gram context is a bi-gram which is overlapped between two Google n(=3,4)-grams that contain two target phrases at the same position. Consider two Google 4-grams "large number of death" and "vast amount of death" where "large number" and "vast amount" are the target phrases and "of death" is an overlapping bi-gram context.

Sum-Ratio (SR)
Sum-Ratio refers to the product of sum and ratio between the minimum (min) and maximum (max) of two numbers. The Sum-Ratio of two numbers indicates the strength of association between them by maximizing the sum of two numbers with respect to their ratio. The objective of Sum-Ratio is to capture the strength of association between two overlapping Google n(=3,4)-grams. Given two numbers a and b, the Sum-Ratio of a and b is defined as follows.

Relatedness Strength
Relatedness strength is the strength of association between two phrases P 1 and P 2 , computed using the Sum-Ratio values between the counts of any two Google n(=3,4)-grams that contain P 1 and P 2 , respectively and an overlapping bi-gram context.

Phrase Detection
Given a specific text, we elicit bi-grams of interest as candidate phrases if they are highly frequent in the Google bi-gram corpus, asserted in the Google Book-Ngram-Viewer (books.google.com/ngrams/info). We adopt a naive approach to detect the bi-gram phrases using the mean (u bg ) and standard deviation (sd bg ) of all Google bi-gram frequencies which are computed once. At first, the whole text is split by stop-words producing a list of c-grams 3 . Then for each c-gram, the following two steps are executed.
Step 1: If the c-gram is a bi-gram and its frequency is greater than u bg + sd bg , then we add it to the list of bi-gram phrases.
Step 2: If the length of c-gram is greater than two, we generate an array of bi-grams from the cgram and find the most frequent bi-gram (mf bg) among them; If the frequency of mf bg is greater than u bg + sd bg , then we add mf bg to the list of bi-gram phrases and split the c-gram into two parts (e.g., left, right) by mf bg. After splitting, for each of the left and right parts, we examine the Step 1 and Step 2 recursively.

Computing Phrase Relatedness
The phrase relatedness function f , computes relatedness strength between two phrases P 1 and P 2 using the Google n-gram corpus (Brants and Franz, 2006) which is then normalized between 0 and 1 using NGD (Cilibrasi and Vitanyi, 2007) in conjunction with NGD´ (Gracia et al., 2006).

Lexical Pruning on the Bi-gram Contexts
At first the bi-gram contexts of phrases are extracted. However some phrases along with their bi-gram contexts do not convey meaningful insight due to the improper positioning of stop-words within bi-gram contexts. Therefore lexical pruning 4 is performed based on the position of stop-words inside the bigram contexts. When the target phrase is placed at the left or right most positions respectively, then the Google n(=3,4)-gram is pruned if the right or left most word is a stop-word. When the phrase is in the middle surrounded by two context words, then the Google n(=3,4)-gram is pruned if both the surrounding context words are stop-words. After performing lexical pruning, we have two sets of non-pruned Google n(=3,4)-grams containing the bi-gram contexts of two phrases, respectively.

Finding Overlapping Bi-gram Contexts
We find the overlapping bi-gram contexts between two sets of non-pruned Google n(=3,4)-grams. The Google n(=3,4)-grams having overlapping bi-gram contexts are separated from the Google n(=3,4)grams that have no overlapping contexts.

Statistical Pruning on the Overlapping Bi-gram Contexts
Each Google n(=3,4)-gram pair with overlapping bigram context possesses a strength of association. We presume that if most of the Google n(=3,4)-gram pairs have higher strengths of association, the relatedness score between two phrases tends to be higher and vice versa. However some strengths of association do not lie within the group of maximum number of strengths of association called outliers and because of the outliers the relatedness score between two phrases becomes inconsistent. Hence we apply statistical pruning on the strengths of association to prune the outliers. To find the group of maximum number of strengths of association and prune the outliers, we adopt the Normal Distribution (Bohm and Zech, 2010) for statistical pruning. It has been shown that in Normal Distribution most of the samples exist within the mean ± standard deviation. We divide each Google n(=3,4)-gram count (frequency) within a pair by the count of its corresponding n(=1,2)-gram phrase, resulting a normalized count. For each Google n(=3,4)-gram pair, the minimum and maximum among the two normalized counts are determined. After that we calculate the ratio (e.g., minimum/maximum) between them. Following that, for each Google n(=3,4)-gram pair, we multiply the ratio with the sum of two Google n(=3,4)-gram counts, producing a resultant product (e.g., strength of association). Later on we compute the mean (u sr ) and standard deviation (sd sr ) from the strengths of association of the Google n(=3,4)gram pairs. If the strength of association is within the u sr ± sd sr , it is kept otherwise pruned.

Computing Relatedness Strength
Relatedness strength between P 1 and P 2 is computed by multiplying the relatedness strengths from overlapping and all bi-gram contexts.

Relatedness Strength using Overlapping
Bi-gram Contexts For each non-pruned Google n(=3,4)-gram pair having overlapping bi-gram context, the strength of association is calculated following the Sum-Ratio technique. We sum the two Google n(=3,4)-gram counts and find the minimum and maximum among them. After that we calculate the ratio (e.g., minimum/maximum) between them. Then the Sum-Ratio value is calculated by multiplying the sum with ratio which signifies the strength of association for a Google n(=3,4)-gram pair. By summing up the strength of association of each Google n(=3,4)gram pair, we get the relatedness strength between the phrases P 1 and P 2 denoted by RSOB(P 1 , P 2 ) as shown in Eq. 1. GP 1 and GP 2 are the Google n(=3,4)-grams that contain P 1 and P 2 , respectively and an overlapping bi-gram context. C(GP 1 ) and C(GP 2 ) are the counts of GP 1 and GP 2 , respectively. k is the number of non-pruned Google n(=3,4)-gram pairs. RSOB(P 1 , P 2 ) = n min(C(GP 1 ), C(GP 2 )) max(C(GP 1 ), C(GP 2 )) × sum(C(GP 1 ), C(GP 2 )) (1)

Relatedness Strength using all Bi-gram Contexts
All bi-gram contexts of a phrase P 1 include both non-pruned overlapping and non-overlapping bigram contexts, extracted from the Google n(=3,4)grams where P 1 appears. Two vectors V 1 and V 2 in Vector Space Model are constructed for P 1 and P 2 , respectively using their corresponding all bigram Contexts. The elements of V 1 and V 2 are binary and reflect the presence or absence of a bi-gram context belonging to the phrases P 1 and P 2 , correspondingly. The relatedness strength between P 1 and P 2 using all bi-gram contexts is designated as cosSim(V 1 , V 2 ), and computed by the cosine similarity between V 1 and V 2 , defined in Eq. 2. (2)

Multiplying Relatedness Strengths from
Overlapping and all Bi-gram Contexts We multiply the relatedness strengths RSOB(P 1 , P 2 ) and cosSim(V 1 , V 2 ) obtained from overlapping and all bi-gram contexts, respectively to compute the overall relatedness strength f (P 1 , P 2 ) between the phrases P 1 and P 2 , defined in Eq. 3. The purpose of multiplying these two strengths is to quantify RSOB(P 1 , P 2 ) with respect to cosSim(V 1 , V 2 ).

Normalizing Overall Relatedness Strength
The relatedness between phrases P 1 and P 2 is computed by normalizing the overall relatedness strength between 0 and 1 using NGD in conjunction with NGD´as defined in Eq. 4. C(P ) is the count of phrase P where P is a Google n(=1,2)-gram. N = total number of web documents used in the Google n-gram corpus.

Computing Text Relatedness
At first punctuations are removed from texts. The phrases are extracted using phrase detection algorithm. Other than phrases the rest of the text is split into non stop-words. The relatedness between two texts is calculated by the word-pair and phrase-pair relatedness following the notion of text relatedness in (Islam et al., 2012). Word-pair relatedness is computed by the word relatedness method in (Islam et al., 2012).
Step 1: We assume that the two texts A = {a 1 , a 2 , ..., a p } and B = {b 1 , b 2 , ..., b q } have p and q tokens, respectively and p ≤ q. Otherwise we switch A and B. A token is a word or bi-gram phrase.
Step 2: We count the number of common tokens (δ) in both A and B where δ ≤ p. Common tokens are determined by applying PorterStemmer (Porter, 1980) on each token pair. Common tokens are removed from A and B. So, A = {a 1 , a 2 , ..., a p−δ } and B = {b 1 , b 2 , ..., b q−δ }. If all tokens match e.g., p − δ = 0, go to step Step 5.
Step 3: We construct a (p − δ) × (q − δ) 'semantic relatedness matrix' (Say, M = (α ij ) (p−δ)×(q−δ) ) using the following process. We set α ij ← relatedness(a i , b j ) × w 2 where i = 1...p − δ, j = 1...q − δ, w = weighting factor to boost the relatedness score. The value of w is the average number of words within a word or phrase-pair. The reason for boosting is that same relatedness score of a phrasepair is more weighted than that of a word-pair. If (a i , b j ) is a word-pair, relatedness(a i , b j ) = word-pair relatedness (Islam et al., 2012); otherwise relatedness(a i , b j ) = phrase-pair relatedness from Eq. 4.
Step 4: For each row we compute the mean (u) and standard deviation (sd) of the relatedness scores and select the scores which are larger than u+sd. The idea is to find more related tokens among (q − δ), for each (p − δ) tokens. The average of the selected scores is computed for a row and for (p − δ) rows we get (p − δ) averages. We sum the (p − δ) average values denoted by SAvg.
Step 5: To compute relatedness between the texts A and B, we use the normalization in (Islam et al., 2012) with minor modification, given in Eq. 5.

Run1
In the first run we consider words, phrases and numbers as tokens. After removing punctuations and stop-words, if any sentence within a pair has no tokens, then the relatedness of that sentence pair is 0.

Run2
The tokens are same as in the first run. After removing punctuations and stop-words, if any sentence within a pair has no tokens, then we keep the stopwords.

Run3
We consider words and phrases as tokens. The following steps are same as in the first run.

Result
The result from three different runs of T rW P are shown in Table 2.

Conclusion
T rW P is an unsupervised text relatedness method that combines both word and phrase relatedness. Both the word and phrase relatedness are computed in unsupervised manner. The word relatedness is computed using the co-occurrences of two words in the Google 3-gram corpus. To compute phrase relatedness, T rW P uses an unsupervised function f based on the Sum-Ratio technique along with the statistical pruning. Unlike other phrase relatedness methods based on word relatedness, f considers the whole phrase as a single unit without losing inner semantic meaning within a phrase.