Neural Latent Relational Analysis to Capture Lexical Semantic Relations in a Vector Space

Capturing the semantic relations of words in a vector space contributes to many natural language processing tasks. One promising approach exploits lexico-syntactic patterns as features of word pairs. In this paper, we propose a novel model of this pattern-based approach, neural latent relational analysis (NLRA). NLRA can generalize co-occurrences of word pairs and lexico-syntactic patterns, and obtain embeddings of the word pairs that do not co-occur. This overcomes the critical data sparseness problem encountered in previous pattern-based models. Our experimental results on measuring relational similarity demonstrate that NLRA outperforms the previous pattern-based models. In addition, when combined with a vector offset model, NLRA achieves a performance comparable to that of the state-of-the-art model that exploits additional semantic relational data.


Introduction
Information on the semantic relations of words is important for many natural language processing tasks, such as recognizing textual entailment, discourse classification, and question answering. There are two main approaches to obtain the distributed relational representations of word pairs.
One is the vector offset method (Mikolov et al., 2013a,b). This approach represents word pairs as the vector offsets of their word embeddings. Another approach exploits lexico-syntactic patterns to obtain word pair representations. As a pioneer work,  introduced latent relational analysis (LRA), based on the latent relation hypothesis. It states that word pairs that co-occur in similar lexico-syntactic patterns tend to have similar semantic relations (Turney, 2008b;Turney and Pantel, 2010). LRA is expected to complement the vector offset model because word embeddings do not contain information on lexico-syntactic patterns that connect word pairs in a corpus (Shwartz et al., 2016).
However, LRA cannot obtain the representations of word pairs that do not co-occur in a corpus. Even with a large corpus, observing a cooccurrence of all semantically related word pairs is nearly impossible because of Zipf's law, which states that most content words rarely occur. This data sparseness problem is a major bottleneck of pattern-based models such as LRA.
In this paper, we propose neural latent relational analysis (NLRA) to solve that data sparseness problem. NLRA unsupervisedly learns the embeddings of target word pairs and co-occurring patterns from a corpus. In addition, it jointly learns the mapping from the word embedding space to the word-pair embedding space. By this mapping, NLRA can generalize the cooccurrences of word pairs and patterns, and obtain the relational embeddings for arbitrary word pairs even if they do not co-occur in the corpus.
Our experimental results on the task of measuring relational similarity show that NLRA significantly outperforms LRA, and it can also capture semantic relations of word pairs without cooccurrences. Moreover, we show that combining NLRA and the vector offset model improves the performance and leads to competitive results to those of the state-of-the-art method that exploits additional semantic relational data.

Vector Offset Model
The vector offset model (Mikolov et al., 2013a,b;Levy and Goldberg, 2014) obtains word embeddings from a corpus and represents each word pair (a, b) as the vector offset of their embedding as where v a and v b are the word embeddings of a and b respectively. This method regards relational information as the change in multiple topicality dimensions from one word to the other in the word embedding space (Zhila et al., 2013). Meanwhile, it does not contain the information of lexico-syntactic patterns that were shown to capture complementary information with word embeddings in previous studies on the lexical semantic relation detection (Levy et al., 2015;Shwartz et al., 2016).

Latent Relational Analysis
LRA takes a set of word pairs as input and generates the distributed representations of those word pairs based on their co-occurring patterns.
Given target word pairs W = {(a 1 , b 1 ), . . . , (a n , b n )}, LRA constructs a list of lexico-syntactic patterns that connect those pairs, such as is a or in the, from the corpus for each word pair. Then, those patterns are generalized by replacing any or all or none of the intervening words with wildcards. As a feature selection, the generalized patterns generated from many word pairs are used as features. We define the set of these target feature patterns as C = {p 1 , . . . , p m }. Then, the 2n × 2m matrix M is constructed. The rows of M correspond to pairs (a i , b i ) and reversed pairs (b i , a i ). The columns of M correspond to patterns Xp i Y and swapped patterns Y p i X, where X and Y are the slots for the words of the word pairs. The value of M ij represents the strength of the association between the corresponding word pair and pattern, which is calculated using weighting methods such as positive pointwise mutual information (PPMI). After these processes, the singular value decomposition (SVD) is applied to M , and the vector v (a,b) is assigned to each word pair (a, b).
Although pattern-based approaches such as LRA have achieved promising results in some semantic relational tasks (Turney, 2008a,b), they have a crucial problem that a co-occurrence of all semantically related word pairs cannot be observed because of Zipf's law, which states that the frequency distribution of words has a long tail. In other words, most words occur very rarely (Hanks, 2009). For the word pairs without co-occurrences, LRA cannot obtain their vector representations.

Neural Latent Relational Analysis
We introduce NLRA, based on the latent relation hypothesis. NLRA represents the target word pairs and lexico-syntactic patterns as embeddings. Similar to the skip-gram model (Mikolov et al., 2013a), NLRA updates those representations unsupervisedly, such that the inner products of the word pairs and patterns in which they co-occur in a corpus have high values. Through this learning, the word pairs that co-occur in similar patterns have similar embeddings. Moreover, NLRA can generalize the co-occurrences of the word pairs and patterns by constructing the embeddings of the word pairs from their word embeddings, thus solving the data sparseness problem of word cooccurrences. Therefore, NLRA can provide representations that capture the information of lexicosyntactic patterns even for the word pairs that do not co-occur in a sentence. Figure 1 is an illustration of our model. NLRA encodes a word pair (a, b) into a dense vector as follows: is the concatenation of the word embeddings of a and b and their vector offsets; M LP is a multilayer perceptron with nonlinear activation functions. A pattern p is a sequence of the words w 1 , . . . , w k . The sequence of the corresponding word embeddings w 1 , . . . , w k are encoded using long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). Then, the final output vector v p is used as the pattern embedding.
For unsupervised learning, we use the negative sampling objective (Mikolov et al., 2013a). Given a set of observed triples (a, b, p) ∈ D, where a and b are words such that (a, b) ∈ W , or (b, a) ∈ W and p is a co-occurring pattern from a corpus, the objective is as follows: where D is a set of randomly generated negative samples and σ is the sigmoid function. We sampled 10 negative patterns for each word pair. This objective is maximized using the stochastic gradient descent.
After unsupervised learning, we can obtain word pair representations v (a,b) as follows: 4 Evaluation

Dataset
In our evaluation, we used the SemEval-2012 Task 2 dataset (Jurgens et al., 2012) for the task of measuring relational similarity. This dataset contains a collection of 79 fine-grained semantic relations. For each relation, there are a few prototypical word pairs and a set of several dozen target word pairs. The task is to rank the target pairs based on the extent to which they exhibit the relation. In our experiment, we calculated the score of a target word pair with the average cosine similarity between it and each prototypical word pair.
The models are evaluated in terms of the MaxDiff accuracy and Spearman's correlation. Following previous works (Rink and Harabagiu, 2012;Zhila et al., 2013), we used the test set that includes 69 semantic relations to evaluate the performance.

Baselines
VecOff. We used the 300-dimensional pre-trained GloVe (Pennington et al., 2014) 1 and represented word pairs as described in Section 2.1.
LRA. We implemented LRA as described in Section 2.2. We set W as the lemmatized word pairs of the dataset. We used the English Wikipedia as a corpus. For each word pair, we searched for patterns of from one to three words. When searching for patterns, the left word and right word adjacent to the patterns were lemmatized to ignore their inflections. Following (Turney, 2008b), we selected C as the top 20|W | generalized patterns. Then, M was constructed using PPMI weighting, and its dimensionality was reduced to 300 using SVD.

Our methods
NLRA. For each word pair in the dataset, cooccurring patterns were extracted from the same corpus in the same manner as with LRA, resulting in D. For word embeddings, we used the same pre-trained GloVe as VecOff. These embeddings were updated during the training. For M LP , we used three affine transformations followed by the batch normalization (Ioffe and Szegedy, 2015) and tanh activation. The size of each hidden layer of the MLP was 300. To encode the patterns, we used LSTM with the 300-dimensional hidden state. The objective was optimized by AdaGrad (Duchi et al., 2011) (whose learning rate was 0.01). We trained the model for 50 epochs.
NLRA+VecOff. This method combines NLRA and VecOff by averaging their score for a target word pair.

Result and Analysis
Table 1 displays the overall result.

NLRA vs. LRA
First, NLRA outperformed LRA in terms of both the average accuracy and correlation. These differences were statistically significant (p < 0.01) with the paired t-test. These results indicate that generalizing patterns with LSTM is better than by using wildcards. Moreover, NLRA can successfully calculate the relational similarity for the word pairs that do not co-occur in the corpus. Table 2 shows an example of the Reference-Express relation, where the middle-score pair handshake:cordiality  and the low-score pair friendliness:wink have no co-occurring pattern. In these cases, LRA could not obtain the representations of those word pairs nor correctly assign the score. By contrast, NLRA could accomplish both because it could generalize the co-occurrences of word pairs and patterns.

NLRA+VecOff vs. Other Models
Second, NLRA+VecOff outperformed the other models. These differences were statistically significant (the correlation difference between NLRA+Vecoff and NLRA: p < 0.05; the other differences: p < 0.01). These results indicate that lexico-syntactic patterns and the vector offset of word embeddings capture complementary information for measuring relational similarity. This is inconsistent with the findings of Zhila et al. (2013). That work combined heterogeneous models, such as the vector offset model, patternbased model, etc., and stated that the pattern-based model was less significant than the vector offset model, based on their ablation study. We believe that this was because their pattern-based model did not generalize patterns with wildcards nor select useful features. Their pattern-based model seemed to suffer from sparse feature space. In our experiment, NLRA helped VecOff, for example, for the Part-Whole relation, Cause Purpose rela-

Model
Accuracy Correlation Rink and Harabagiu (2012) 0.394 0.229 Mikolov et al. (2013b) 0.418 0.275 Levy and Goldberg (2014) 0.452 - Zhila et al. (2013) 0.452 0.353 Iacobacci et al. (2015) 0  tion, and Space-Time relation, where there seemed to be prototypical patterns indicating those relations. Meanwhile, VecOff helped NLRA for the Attribute relation, where the relational patterns seemed to be diverse. These results showed that the combined model is robust.

Comparison to other systems
We compared the results of our models to other published results. Table 3 displays those results. Rink and Harabagiu (2012) is the pattern-based model with naive Bayes. Mikolov et al. (2013b), Levy and Goldberg (2014), and Iacobacci et al. (2015) are the vector offset models. Zhila et al. (2013) is the model composed of various features. Turney (2013) extracts the statistical features of two word pairs from a word-context co-occurrence matrix and trains the classifier with additional semantic relational data to assign a relational similarity for two word pairs. NLRA+VecOff achieved a competitive performance to the state-of-the-art method of Turney (2013). Note that our method learns unsupervisedly and does not exploit additional resources, and the method of Turney (2013) cannot obtain the distributed representation of word pairs.
A work similar to ours, Bollegala et al. (2015), represented lexico-syntactic patterns as the vector offset of co-occurring word pairs and updated the vector offsets of word pairs such that word pairs that co-occur in similar patterns have similar offsets. They evaluated their model on all 79 semantic relations of the dataset and achieved 0.449 accuracy. In their setting, NLRA+VecOff achieved 0.47 accuracy, outperforming their model. 5 Related Work 5.1 Word Pairs and Co-occurring Patterns Hearst (1992) detected the hypernymy relation of word pairs from a corpus using several handcrafted lexico-syntactic patterns. Turney and Littman (2005) used 64 handcrafted lexicosyntactic patterns as features of word pairs to represent word pairs as vectors. To obtain word-pair embeddings,  extended the method of Turney and Littman (2005) as LRA. Our work is a neural extension of LRA. Washio and Kato (2018) proposed the method similar to ours in lexical semantic relation detection. Their neural method modeled the cooccurrences of word pairs and dependency paths connecting two words to alleviate the data sparseness problem of pattern-based lexical semantic relation detection. While they assigned randomly initialized embeddings to each dependency path, our work encodes co-occurring patterns with LSTM for better generalization. Jameel et al. (2018) embedded word pairs with the context words occurring around word pairs instead of lexico-syntactic patterns. Their method cannot obtain embeddings of word pairs that do not co-occur in a corpus because they directly assigned embeddings to word pairs. By contrast, NLRA can obtain embeddings for those word pairs.
In another research area, relation extraction, several works have explored an idea similar to the latent relation hypothesis (Riedel et al., 2013;Toutanova et al., 2015;Verga et al., 2017). They factorized a matrix of entity pairs and co-occurring patterns, while they focused on named entity pairs instead of word pairs and did not consider cooccurrence frequencies.

Relation to Knowledge Graph Embedding
Knowledge graph embedding (KGE) embeds entities and relations in knowledge graph (KG), where entities and relations corresponds to nodes and edges respectively (Nickel et al., 2011;Bordes et al., 2013;Socher et al., 2013;Wang et al., 2014;Lin et al., 2015;Nickel et al., 2016;Trouillon et al., 2016;Liu et al., 2017;Wang et al., 2017;Ishihara et al., 2018). By considering words and lexico-syntactic patterns as nodes and edges, respectively, a corpus can be viewed as a graph, i.e., corpus graph (CG). Thus, NLRA can be regarded as corpus graph embedding (CGE) models based on the latent relation hypothesis. Although KGE models can be easily applied to CG, several differences exist between KG and CG. First, the nodes and edges of CG are (sequences of) linguistic expressions, such as tokens, lemmas, phrases, etc. Thus, the nodes and edges of CG might exhibit compositionality and ambiguity, while KG does not have those properties. Second, the edges of CG have weights based on cooccurrence frequencies unlike the edges of KG. Finally, CG might have a large number of edges types while the number of KG edges is at most several thousands. An interesting research direction is exploring models suitable for CGE to capture the property of linguistic expressions and their relations in the embedding space.

Conclusion
We presented NLRA, which learns the distributed representation of word pairs capturing semantic relational information through co-occurring patterns encoded by LSTM. This model jointly learns the mapping from the word embedding space into the word-pair embedding space to generalize co-occurrences of word pairs and patterns. Our experiment on measuring relational similarity demonstrated that NLRA outperforms LRA and can successfully solve the data sparseness problem of word co-occurrences, which is a major bottleneck in pattern-based approaches. Moreover, combining the vector offset model and NLRA yielded competitive performance to the state-ofthe-art method, though our method relied only on unsupervised learning. This combined model exploits the complementary information of lexicosyntactic patterns and word embeddings.
In our future work, we will apply word-pair embeddings from NLRA to various downstream tasks related to lexical relational information.