Measuring Similarity from Word Pair Matrices with Syntagmatic and Paradigmatic Associations

Two types of semantic similarity are usually distinguished: attributional and relational similarities. These similarities measure the degree between words or word pairs. Attributional similarities are bidrectional, while relational similarities are one-directional. It is possible to compute such similarities based on the occurrences of words in actual sentences. Inside sentences, syn-tagmatic associations and paradigmatic associations can be used to characterize the relations between words or word pairs. In this paper, we propose a vector space model built from syn-tagmatic and paradigmatic associations to measure relational similarity between word pairs from the sentences contained in a small corpus. We conduct two experiments with different datasets: SemEval-2012 task 2, and 400 word analogy quizzes. The experimental results show that our proposed method is effective when using a small corpus.


Introduction
Semantic similarity is a complex concept which has been widely discussed in many research domains (e.g., linguistics, philosophy, information theory communication, or artificial intelligence). In natural language processing (NLP), two types of semantic similarity are identified: attributional and relational similarities. Until now, many researchers reported for measuring these similarities.
Attributional similarity consists in comparing semantic attributes contained in each word. For example, the two words car and automobile share many attributes and, consequently, their attributional similarity is high , whereas the attributional similarity between car and drive is low. If the attributional similarity is high, this means that the words are structurally similar. Indeed, car and automobile are considered as synonyms because they share almost all of their structural attributes. Attributional similarity is not confined to synonymy but is also related to such relations as hypernymy/hyponymy.
Relational similarity compares the semantic relations between pairs of words. For example, fish : fins :: bird : wings asserts that fish is to fins as bird is to wings: i.e., the semantic relations between fish and fins are highly similar to the semantic relations between bird and wings. To find the relational similarity between two words, knowledge resources such as WordNet (Miller, 1995) or Ontology (Suchanek et al., 2007) are generally used. Lexical syntactic patterns between two words also help in identifying relational similarity. For instance, the lexical syntactic patten 'is a' helps to identify hypernyms (Hearst, 1992;Snow et al., 2004).
To measure the attributional similarity between words or the relational similarity between word pairs, Vector Space Models (VSM) are mainly used Turney and Littman, 2005;Turney, 2006). The expressiveness of a vector space model differs in the way it is built the matrices. The different way to build the matrices is based on two types of associations. In this paper, we use two types of associations which are well-known in linguistics: syntagmatic associations and paradigmatic associations.
Syntagmatic associations originate from word co-occurrences in texts. Latent Semantic Analysis (LSA) relies on such syntagmatic associations. It has been successful at simulating a wide range of psychological and psycholinguistic phenomena, from judgments of semantic similarity (Landauer and Dumais, 1997). Paradigmatic associations, however, reflect more the semantic attributes of words. Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996) is related to LSA, but also makes use of paradigmatic associations by capitalizing on positional similarities between words across contexts. LSA and HAL consider simply different types of space built from texts, and the differences are reflected in the structural representations formed by each model (Jones and Mewhort, 2007).
In this paper, we propose a vector space model with both syntagmatic and paradigmatic associations to measure relational similarity between word pairs. The dimensions for each word pair in our proposed model show the distribution between words. To avoid data sparseness in the dimensions, we make use of a word clustering method in a preprocessing step. We then build our proposed model with syntagmatic and paradigmatic associations on the results of the clustering step. We conduct two experiments on SemEval-2012 task 2 and Scholastic Assessment Test (SAT) analogy quizzes to measure relational similarity to evaluate our model.
The rest of the paper is organized as follows. We describe similar research in Section 2. Our proposed vector space model to capture syntagmatic and paradigmatic associations is presented in Section 3. The experimental results and evaluations for relational similarity, and SAT analogy quizzes are shown in Section 4. We present our conclusions in Section 5.

Related work
A popular approach with vector space model for measuring similarities between words is to compute the similarities of their distribution in large text data. The underlying assumption is the distributional hypothesis (Harris, 1954): words with similar distribution in language should have similar meanings. The two main approaches, LSA and HAL, for producing word spaces differ in the way context vectors are produced. LSA with term-document matrices have a greater potential for measuring semantic similarity between words. LSA capitalizes on a word's contextual co-occurrence, but not on how a word is used in that context. HAL's co-occurrence matrix is a sparse word-word matrix. In HAL, words that appear in similar positions around the same words tend to develop similar vector representations. HAL is related to LSA, but HAL can be said to insist more on paradigmatic associations and LSA more on syntagmatic associations.
Bound Encoding of the AGgregate Language Environment (BEAGLE) (Jones and Mewhort, 2007) is a model that combines syntagmatic and paradigmatic associations. The BEAGLE model has two matrices for representing word meanings with syntagmatic and paradigmatic associations: one for order information and another one for contextual information. By combining the order information and contextual information, the BEAGLE model can express syntagmatic and paradigmatic associations. These models are built from word to word co-occurrences and word to document (context) co-occurrences, which measure only attributional similarity between words. We claim, however, that attributional similarity between words is of little value. For example, the attributional similarity between "fish" and "fins" is weak, and it is also the case between "bird" and "wings". However, in terms of relational similarity, there is a high similarity between "fish:fins" and "bird:wings". This shows that there may be more potentiality in comparing word pairs rather than simply words.  and Turney and Littman (2005) used an approach called Latent Relational Analysis (LRA) in which a vector space of distributional features was derived from a large Web corpus and then reduced using singular value decomposition (SVD). For measuring relational similarity, the similarity between two pairs is calculated by the cosine of the angle between the vectors that represent the two pairs in their approach. The main difference between LSA and LRA is the way the semantic space is built. In LSA, the word-document matrices are built for measuring attributional similarity between words as above mentions. In LRA, the pair-pattern matrices are built for measuring relational similarity between word pairs. As an extension, Turney (2008) designed the Latent Relation Mapping Engine (LRME), by combining ideas from the Structure Mapping Engine (SME) (Gentner, 1983) and LRA, to remove the requirement for hand-coded representations in SME. Here, we consider that syntagmatic and paradigmatic associations can adapted to pair-pattern matrices for measuring relational similarity. The extension of pair-pattern matrices are pair-feature matrices in our proposed model.

Proposed model
In this section, we describe our proposed pair-feature matrices which capture syntagmatic and paradigmatic associations. To build the pair-feature matrices, we consider that syntagmatic associations between words are co-occurrences and paradigmatic associations are substitutions between words in the same contexts. The direct use of such features leads to a large number of dimensions, which may result in data sparseness. Section 3.1 will be dedicated to the solution we propose to avoid this problem. We show how to build our pair-feature matrices with syntagmatic and paradigmatic associations in Section 3.2.

Data sparseness
A critical problem in statistical natural language processing is data sparseness. One way to reduce this problem is to group words into equivalence classes. Typically, word classes are used in language modeling to reduce the problem of data sparseness.
The practical goal of our proposal is to achieve reasonable performance in measuring relational similarity and semantic proportional analogy from a small corpus. We will show that even small corpora have a great potential to measure similarity in actual tasks. Building a pair-feature matrices in such a setting obviously leads to sparseness since word pairs do not easily co-occur in the sentences of small corpora. We use clustering methods to cluster words into equivalence classes to reduce the problem. Here, we make use of monolingual word clustering (Och, 1999) 1 . This method is based on maximum-likelihood estimation with Markov model. We build our proposed pair-feature model described in Section 3.2 based on the results of word clustering.

Vector Space Model (VSM)
VSM (Salton et al., 1975) is an algebraic model for representing any object as a vector of identifiers. There are many ways to build a semantic space, like term-document, term-context, and pair-pattern matrices (Turney and Pantel, 2010). Turney (2006) showed that pair-pattern matrices are suited to measuring the similarity of semantic relations between pairs of words; that is, relational similarity. Conversely, word-context matrices are suited to measuring attributional similarity.
In this paper, we build a vector space of pair-feature after preprocessing the training corpus by a word clustering method. In a pair-feature matrix, row vectors correspond to pairs of words, such as "fish:fins" and "bird:wings", and column vectors correspond to the features grouped by the word clustering method. We set 3 × N column vector size, N features annotated by the word clustering method described in previous section. The reason for setting the vector size to three times the number of features is to represent syntagmatic and paradigmatic associations in our proposed model. Our main original idea is to build a column vector of affixes. A sentence containing a word pair is divided into three parts: • a prefix part, which consists in the word classes found around the first word of the word pair in the sentence in a window of a given size called the context window; • an infix part, which consists in the word classes of the words found the words of between the word pair in the sentence; • a suffix part, which consists in the word classes found around the second word of the word pair in the sentence in a window of a given size (context window); We suppose that prefixes and suffixes are paradigmatic features and that infixes are syntagmatic features. The paradigmatic features indirectly capture similar words around the first and the second words. By opposition, the syntagmatic features directly capture the syntactical pattern between a word pair. These features also characterize the syntactic structure of sentences. This model will deliver similar features for word pairs appearing in sentences exhibiting similar syntactic patterns. By combining syntagmatic and paradigmatic features in our proposed model, we can express these associations in one vector space.
We show below an example of how to build our pair-feature matrix representation. Let us consider the three following sentences. diurnal bird of prey typically having short rounded wings and a long tail, (i) tropical fish with huge fanlike pectoral fins for underwater gliding, (ii) the occupation of catching fish for a living. (iii) The words in the three sentences are clustered by the word clustering tool as indicated in Table 1 the sentences annotated with the word classes, we add up weights for each class c for each feature part in the pair-feature matrix (see Table 5) according to the following formula.
, if w 1 and w 2 co-occur in the sentence f (c), if only one of w 1 or w 2 occurs in the sentence 0, if neither w 1 nor w 2 occurs in the sentence (1) Here, c is the class of a word (e.g., c1, c2, or c3) and f is the frequency of c, i.e., the number of times the class c appears in the sentence considered for each feature part (prefix, infix, suffix). The proportion p(c) of a class is the relative proportion of occurrences of this class computed over the entire corpus. We show how to compute each feature in Table 2. If the word pair co-occur in some sentences, the weight Features prefix (around w 1 ) infix (between w 1 and w 2 ) suffix (around w 2 ) w 1 and w 2 co-occur Table 2: Computation of weights for a given c and a given word pair "w 1 :w 2 " for a given sentence.
is modified by the self-information. If one word in the word pair occurs alone in some sentences, we compute only paradigmatic feature part (syntagmatic feature part, infix, is 0). All the weights coming from all the sentences are added up for each class for each feature part in the final vector corresponding to one word pair. In VSM, the weighting scheme is poring-wise information or TF-IDF. For example, given the word pair "fish:fins", the feature parts are defined as follows: bird of prey typically having short rounded wings and a long tail, tropical fish with huge fanlike pectoral fins for underwater gliding, the occupation of catching fish for a living.
The boxes are the syntagmatic feature parts (only one here) and these underlined are paradigmatic features (in sentence (ii) prefix and suffix parts, in sentence (iii), prefix part only because 'fish' is the first word in the word pair "fish:fins"). We show the computation of f in Table 3 for the same given word pair "fish:fins". The prefixes are the words around fish, the infixes are the words between fish and fins, and the suffixes are the words around fins from our main idea.   Let us consider a word pair which is not found in any sentence, e.g., the word pair "fish:eat". The computation of f in this case is shown in Table 4. The word fish occurs in the sentence (iii). The word eat does not appear in any sentence. Consequently, the frequency of each class is 0 in the suffix feature part. Table 5 shows the pair-feature matrix computed from the three above sentences for three word pairs. Each cell in Table 5 is computed using the results given in Tables 1-4. For example, for "fish:fins" the  Table 5: Pair-feature matrix computed using sentences (i)-(iii).
value for c1 in the prefix is 3.58 (computed according to Formula 1 using − log p(c1) = 1.79 in Table 1 and f (c1) = 2 in Table 3). The infix cells corresponding to "fish:eat" are all 0.0 because of the null values for each class in Table 4. After building the pair-feature space, we make use of SVD to induce an approximation space. SVD is used to reduce the noise and compensate for the zero vectors in the model. We show that the formula is as follows: Here, M is the pair-feature matrix (dimensions: n × m), U is the pair matrix (dimensions: n × r), Σ is a diagonal matrix of singular values (dimensions: r × r) and V is feature matrix (dimensions: m × r). n is the number of word pairs, m is the number of classes grouped by the word clustering method and r is rank of M . If M is of rank r, then Σ is also of rank r. We can redefine the value k using Formula 3 instead of r.
Let Σ k , where k ≤ r, be the diagonal matrix formed from the top k singular values. And let U k and V k be the matrices produced by determining the corresponding columns from U and V . We determine the k (latent size) for our experiments empirically. This formula means that it is to remove the noise in the matrices M by using dimension reduction. Section 4 will show how we set the parameters in our experiments.

Relational and attributional similarity
In our proposed framework, relational similarity can be measured by using the distributions over two word pairs. After building the new spaceM according to Formula 3, we measure relational similarity between word pairs such as "A:B" and "C:D" in a classical way by computing their cosine: Here, i and j are word pairs indexes and ||M i || is the norm. It is usually thought that attributional similarity can be deduced from relational similarity (i.e., this means two-sideness). For instance, Bollegala et al. (2012) showed how to measure the degree of synonymy between words using relational similarity. Their formula for measuring attributional similarity between words using relational similarity between word pairs is as follows: Here T is a set of synonymy word pair collected from WordNet and |T | is the cardinality of a set of T . If A and B are highly similar to that between synonymous words, this means that A and B themselves must also be synonymous.
To test measures of attributional similarity between words, the Miller-Charles dataset (Miller and Charles, 1991) is commonly used. The data consist of 30 word pairs such as "gem:jewel", all of them being nouns. The relatedness of each word pair has been rated by 38 human subjects, using a scale from 0 to 4. It should be said that the application of our proposed model to this task delivers results (0.28) which are far below the usually reported scores (around 0.87). This is explained by the fact that our model is not designed for attributional similarity, but aims directly at measuring relational similarity. The results indicate that the paradigmatic features are not useful to measure the attributional similarity between words in our proposed model. As a other method to measure the attributional similarity between words, point-wise mutual information is generally used.

Experiments and results
We perform two experiments on two datasets to prove the validity of our proposed model against the purpose it was designed for: the measure of relational similarity. In the two experiments, we make use of one corpus which contains about 150,000 sentences and about one million tokens. We set the latent size of Formula 3 to 40 to remove the noise in the matrices. The context window size is 2 for the paradigmatic features (prefixes and suffixes). The range of the syntagmatic feature (infixes) is from 1 to 5.
The first experiment shown in Section 4.1 directly outputs a measure of the relational similarity. The second experiment, on SAT analogy quizzes in Section 4.2 uses relational similarity to rank candidates. In both experiments, we do not preprocess with stemming and do not delete stop words.

Direct measure of relational similarity
To test our measure of relational similarity between word pairs, we make use of the SemEval-2012 task 2 (Jurgens et al., 2012). Jurgens et al. (2012) constructed a data set of prototypical ratings for 3,218 word pairs in 79 different relation categories with the help of Amazon Mechanical Turk 2 .
There are two phases for measuring the degree of relational similarity in this task. The first phase is to generate pairs of a given relation. We do not perform this phase here. Another phase is used to rate word pairs from given word pairs. This task selects least and most illustrative word pairs in four word pairs ("oak:tree"; "vegetable:carrot"; "tree:oak"; "currency:dollar") based on several given word pairs ("flower:tulip", "emotion:rage", "poem:sonnet"). To rate word pairs, this task makes use of the MaxDiff technique (Louviere and Woodworth, 1991). The set with 79 word relations was randomly split into training and testing sets. The training set contains 10 relations and the test set contains 69 relations. For each relation, about one hundred questions were created.
We present how to determine the least and most illustrative word pairs in the four word pairs. The formula for rating a word pairs is as follows: Here, relsim is the same as shown in Section 3.3, T is a set of several given word pairs, and |T | is the number of given word pairs. The score indicates that the higher is the most illustrative and the lower is the least illustrative for the four word pairs. This formula rates a word pair from several given word pairs by using relational similarity since the relation between the given word pairs is proportional to a targeted word pair.
score is 35.1 by using our proposed model. Comparing with other methods on the ACL wiki 3 in Table 6, our method is lower, but is higher than UTD-SVM. We also detail the results for each category in Table 7  lowest score for PART-WHOLE category (the score is 30.4), but all the other scores are lower than UTD-NB. We consider that it is easy to capture the syntagmatic and paradigmatic associations in our proposed model for CLASS-INCLUSION category than for PART-WHOLE category. Our pair-feature matrices are influenced by paradigmatic features when word pairs do not co-occur in any similar context. For measuring relational similarity, we consider that syntagmatic and paradigmatic associations are sufficient in our model from this results.

SAT analogy quizzes
We use 400 SAT analogy quizzes from a set of 501 (Dermott, 2002). 101 SAT analogy quizzes were discarded as they concern named entities (e.g., Van Buren : 8th :: Lincoln : 16th ), symbolic or notational variants (e.g., V : X :: L : C ), or the like, which are obviously out of the reach of our proposed model. The SAT analogy quizzes of Van Buren : 8th :: Lincoln : 16th and V : X :: L : C are domainspecific cases in that domain-specific knowledge is needed to solve them. No specific domain knowledge is needed to solve fish : fins :: bird : wings. We show an example of the resolution of a proportional analogy quiz in Table 8   out of the four, we rank them using the relational similarity of the candidate with the fourth word in the quiz. The rank is computed using Formula 4. As an example, in Table 8, we give the degree of relational similarity for the previous quiz. The selected answer is furnish, and the semantic relation between the word pairs is synonymy. The results on 400 SAT analogy quizzes are given in Table 9 along with the accuracy of other methods. We obtain the highest score with our proposed model against another model, Word2vec (Mikolov et al., Algorithm Reference Accuracy Random 0.22 Word2vec (Mikolov et al., 2013a) 0.20 Ours 0.27 Table 9: The evaluations comparing with other methods. 2013a) 4 , and a baseline model that draws a solution at random. It should be noticed that, here, word pairs do not involve only noun to noun pairs but also involve noun to verb pairs. Our model is effective in answering the proportional analogy quizzes by using syntagmatic and paradigmatic associations from a small corpus. It achieves this by using a training corpus of about 10 megabytes in size to build a pair-feature vector space. By contrast, Word2vec requires 100 megabytes of training corpus and fails at building a word space which is precise enough, to beat random selection. This clearly shows that clustering of words can make up for size of corpus and we can acquire the better accuracy. The SAT analogy quizzes and the SemEval-2012 task 2 are separate tasks. To assess the quality of proportional analogies two aspects are needed: vertical and horizontal dimensions. On the an example fish : fins :: bird : wings, the vertical dimension is between "fish:bird" and "fins:wings" and the horizontal dimension is between "fish:fins" and "bird:wings". In all generality, we should examine the score function of proportional analogies on both vertical and horizontal dimensions but practically the vertical dimension is not so important in SAT analogies quizzes.

Conclusion
Attributional similarity and relational similarity are usually distinguished in the study of semantic similarity. Many researchers proposed to build a various of vector space models to measure the attributional similarity between words or the relational similarity between word pairs. Such similarities are commonly used to solve semantic problems on words, phrase or sentences in the NLP literature.
In this paper, we presented a pair-feature matrix model with syntagmatic and paradigmatic associations for measuring relational similarity. By using a sentence containing a word pair is divided into three parts, we represented the syntagmatic and paradigmatic associations for each word pair. We made use of a word clustering method to cope with data sparseness in a preprocessing step. We performed two experiments with different datasets: SemEval-2012 task 2, and SAT analogy quizzes. These experiments show that the pair-feature matrix model with syntagmatic and paradigmatic associations is effective to measure relational similarity. In future work, we propose to make use of stemming and to delete stop words to reduce even more the noise that affects decrease the performance of the word clustering step we introduced to deal with data sparseness.