Filling Missing Paths: Modeling Co-occurrences of Word Pairs and Dependency Paths for Recognizing Lexical Semantic Relations

Recognizing lexical semantic relations between word pairs is an important task for many applications of natural language processing. One of the mainstream approaches to this task is to exploit the lexico-syntactic paths connecting two target words, which reflect the semantic relations of word pairs. However, this method requires that the considered words co-occur in a sentence. This requirement is hardly satisfied because of Zipf’s law, which states that most content words occur very rarely. In this paper, we propose novel methods with a neural model of P(path|w1,w2) to solve this problem. Our proposed model of P (path|w1, w2 ) can be learned in an unsupervised manner and can generalize the co-occurrences of word pairs and dependency paths. This model can be used to augment the path data of word pairs that do not co-occur in the corpus, and extract features capturing relational information from word pairs. Our experimental results demonstrate that our methods improve on previous neural approaches based on dependency paths and successfully solve the focused problem.


Introduction
The semantic relations between words are important for many natural language processing tasks, such as recognizing textual entailment (Dagan et al., 2010) and question answering (Yang et al., 2017).Moreover, these relations have been also used as features for neural methods in machine translation (Sennrich and Haddow, 2016) and relation extraction (Xu et al., 2015).This type of information is provided by manually-created semantic taxonomies, such as WordNet (Fellbaum, 1998).However, these resources are expensive to expand manually and have limited domain coverage.Thus, the automatic detection of lexicosemantic relations has been studied for several decades.
One of the most popular approaches is based on patterns that encode a specific kind of relationship (synonym, hypernym, etc.) between adjacent words.This type of approach is called a pathbased method.Lexico-syntactic patterns between two words provide information on semantic relations.For example, if we see the pattern, "animals such as a dog" in a corpus, we can infer that animal is a hypernym of dog.On the basis of this assumption, Hearst (1992) detected the hypernymy relation of two words from a corpus based on several handcrafted lexico-syntactic patterns, e.g., X such as Y. Snow et al. (2004) used as features indicative dependency paths, in which target word pairs co-occurred, and trained a classifier with data to detect hypernymy relations.
In recent studies, Shwartz et al. (2016) proposed a neural path-based model that encoded dependency paths between two words into lowdimensional dense vectors with recurrent neural networks (RNN) for hypernymy detection.This method can prevent sparse feature space and generalize indicative dependency paths for detecting lexico-semantic relations.Their model outperformed the previous state-of-the-art path-based method.Moreover, they demonstrated that these dense path representations capture complementary information with word embeddings that contain individual word features.This was indicated by the experimental result that showed the combination of path representations and word embeddings improved classification performance.In addition, Shwartz and Dagan (2016) showed that the neural path-based approach, combined with word embeddings, is effective in recognizing multiple semantic relations.
Although path-based methods can capture the relational information between two words, these methods can obtain clues only for word pairs that co-occur in a corpus.Even with a very large corpus, it is almost impossible to observe a cooccurrence of arbitrary word pairs.Thus, pathbased methods are still limited in terms of the number of word pairs that are correctly classified.
To address this problem, we propose a novel method with modeling P (path|w 1 , w 2 ) in a neural unsupervised manner, where w 1 and w 2 are the two target words, and path is a dependency path that can connect the joint co-occurrence of w 1 and w 2 .A neural model of P (path|w 1 , w 2 ) can generalize co-occurrences of word pairs and dependency paths, and infer plausible dependency paths which connect two words that do not co-occur in a corpus.After unsupervised learning, this model can be used in two ways: • Path data augmentation through predicting dependency paths that are most likely to cooccur with a given word pair.
• Feature extraction of word pairs, capturing the information of dependency paths as contexts where two words co-occur.
While previous supervised path-based methods used only a small portion of a corpus, combining our models makes it possible to use an entire corpus for learning process.Experimental results for four common datasets of multiple lexico-semantic relations show that our methods improve the classification performance of supervised neural path-based models.

Supervised Lexical Semantic Relation Detection
Supervised lexical semantic relation detection represents word pairs (w 1 , w 2 ) as feature vectors v and trains a classifier with these vectors based on training data.For word pair representations v, we can use the distributional information of each word and path information in which two words cooccur.
Several methods exploit word embeddings (Mikolov et al., 2013;Levy and Goldberg, 2014;Pennington et al., 2014) as distributional information.These methods use a combination of each word's embeddings, such as vector concatenation (Baroni et al., 2012;Roller and Erk, 2016) or vector difference (Roller et al., 2014;Weeds et al., 2014;Vylomova et al., 2016), as word pair representations.While these distributional supervised methods do not require co-occurrences of two words in a sentence, Levy et al. (2015) notes that these methods do not learn the relationships between two words but rather the separate property of each word, i.e., whether or not each word tends to have a target relation.
In contrast, supervised path-based methods can capture relational information between two words.These methods represent a word pair as the set of lexico-syntactic paths, which connect two target words in a corpus (Snow et al., 2004).However, these methods suffer from sparse feature space, as they cannot capture the similarity between indicative lexico-syntactic paths, e.g., X is a species of Y and X is a kind of Y.

Neural Path-based Method
A neural path-based method can avoid the sparse feature space of the previous path-based methods (Shwartz et al., 2016;Shwartz and Dagan, 2016).Instead of treating an entire dependency path as a single feature, this model encodes a sequence of edges of a dependency path into a dense vector using a long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997).
A dependency path connecting two words can be extracted from the dependency tree of a sentence.
For example, given the sentence "A dog is a mammal," with X = dog and Y = mammal, the dependency path connecting the two words is X/NOUN/nsubj/> be/VERB/ROOT/-Y/NOUN/attr/<.Each edge of a dependency path is composed of a lemma, part of speech (POS), dependency label, and dependency direction.Shwartz et al. (2016) represents each edge as the concatenation of its component embeddings: where v l , v pos , v dep ,and v dir represent the embedding vectors of the lemma, POS, dependency label, and dependency direction respectively.This edge vector e is an input of the LSTM at each time step.Here, h t , the hidden state at time step t, is abstractly computed as: where LST M computes the current hidden state given the previous hidden state h t−1 and the current input edge vector e t along with the LSTM architecture.The final hidden state vector o p is treated as the representation of the dependency path p.
When classifying a word pair (w 1 , w 2 ), the word pair is represented as the average of the dependency path vectors that connect two words in a corpus: v (w 1 ,w 2 ) = v paths(w 1 ,w 2 ) = p∈paths(w1,w2) f p,(w1,w2) • o p p∈paths(w1,w2) f p,(w1,w2) (3) where paths(w 1 , w 2 ) is the set of dependency paths that connects w 1 and w 2 in the corpus, and f p,(w 1 ,w 2 ) is the frequency of p in paths(w 1 , w 2 ).The final output of the network is calculated as follows: where W ∈ R |c|×d is a linear transformation matrix, b ∈ R |c| is a bias parameter, |c| is the number of the output class, and d is the size of v (w 1 ,w 2 ) .This neural path-based model can be combined with distributional methods.Shwartz et al. (2016) concatenated v paths(w 1 ,w 2 ) to the word embeddings of w 1 and w 2 , redefining v (w 1 ,w 2 ) as: (5) where v w 1 and v w 2 are word embeddings of w 1 and w 2 , respectively.This integrated model, named LexNET, exploits both path information and distributional information, and has high generalization performance for lexical semantic relation detection.

Missing Path Problem
All path-based methods, including the neural ones, suffer from data sparseness as they depend on word pair co-occurrences in a corpus.However, we cannot observe all co-occurrences of semantically related words even with a very large corpus because of Zipf's law, which states that the frequency distribution of words has a long tail; in other words, most words occur very infrequently (Hanks, 2009).In this paper, we refer to this phenomenon as the missing path problem.This missing path problem leads to the fact that path-based models cannot find any clues for two words that do not co-occur.Thus, in the neural path-based method, paths(w 1 , w 2 ) for these word pairs is padded with an empty path, like UNK-lemma/UNK-POS/UNK-dep/UNK-dir.
However, this process makes path-based classifiers unable to distinguish between semanticallyrelated pairs with no co-occurrences and those that have no semantic relation.
In an attempt to solve this problem, Necsulescu et al. (2015) proposed a method that used a graph representation of a corpus.In this graph, words and dependency relations were denoted as nodes and labeled directed edges, respectively.From this graph representation, paths linking two target words can be extracted through bridging words, even if the two words do not co-occur in the corpus.They represent word pairs as the sets of paths linking word pairs on the graph and train a support vector machine classifier with training data, thereby improving recall.However, the authors reported that this method still suffered from data sparseness.
In this paper, we address this missing path problem, which generally restricts path-based methods, by neural modeling P (path|w 1 , w 2 ).

Method
We present a novel method for modeling P (path|w 1 , w 2 ).The purpose of this method is to address the missing path problem by generalizing the co-occurrences of word pairs and dependency paths.To model P (path|w 1 , w 2 ), we used the context-prediction approach (Collobert and Weston, 2008;Mikolov et al., 2013;Levy and Goldberg, 2014;Pennington et al., 2014), which is a widely used method for learning word embeddings.In our proposed method, word pairs and dependency paths are represented as embeddings that are updated with unsupervised learning through predicting path from w 1 and w 2 (Section 3.1).
After the learning process, our model can be used to (1) augment path data by predicting the plausibility of the co-occurrence of two words and a dependency path (Section 3.2); and to (2) extract useful features from word pairs, which reflect the information of co-occurring dependency paths (Section 3.3).

Unsupervised Learning
There are many possible ways to model P (path|w 1 , w 2 ).In this paper, we present a straightforward and efficient architecture, similar to the skip-gram with negative sampling (Mikolov et al., 2013).
2 r T A 9 3 S M 7 3 / W i s I a 3 T + 0 u R d 7 W q F U x 0 7 n i y 8 / a s y e f d x 8 K n 6 Q 6 F y 9 t + e f O x j M f S i s z c n Z D o u t W 7 9 x t F p u 7 C 0 O R v M 0 R W 9 s L 9 L e q J 7 d m g 1 X r X r D b F 5 h g Q 3 S P 7 e j p + g t J C R K S N v U D q 3 E r U q j i n M Y J 7 7 k U U O a 8 i j y O 8 6 O M E 5 L q S U l J W W p V w 3 V Y p F m g l 8 C W n 9 A 3 Y x l Q s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G t r n D U q 0 q 8 b m r + x j E j P c j 3 n k s I o 8 S v y u g 2 O c 4 V x K S / P S k p R r p 0 o d s W Y U X 0 J a + w B 3 c Z U P < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G t r n D U q 0 q 8 b m r + x j E j P c j 3 n k s I o 8 S v y u g 2 O c 4 V x K S / P S k p R r p 0 o d s W Y U X 0 J a + w B 3 c Z U P < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G t r n D U q 0 q 8 b m r + x j E j P c j 3 n k s I o 8 S v y u g 2 O c 4 V x K S / P S k p R r p 0 o d s W Y U X 0 J a + w B 3 c Z U P < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G t r n D U q 0 q 8 b m r + x j E j P c j 3 n k s I o 8 S v y u g 2 O c 4 V x K S / P S k p R r p 0 o d s W Y U X 0 J a + w B 3 c Z U P < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G t r n D U q 0 q 8 b m r + x j E j P c j 3 n k s I o 8 S v y u g 2 O c 4 V x K S / P S k p R r p 0 o d s W Y U X 0 J a + w B 3 c Z U P < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G t r n D U q 0 q 8 b m r + s 7 p 0 e 6 Y 4 d 2 7 d W 4 s 7 p 0 e 6 Y 4 d 2 7 d W 4 s 7 p 0 e 6 Y 4 d 2 7 d W 4 s h a 1 _ b a s e 6 4 = " F 4 S 4 e d 4 z g F x 8 3 X/NOUN/nsubj/>be/VERB/ROOT/-Y/NOUN/attr/< < l a t e x i t s h a 1 _ b a s e 6 4 = " E 1 o Z q r / 5 j w / r 4 d V b P n z H 6 l y 6 v 6 l g r 7 G q 0 v 5 p 7 / 1 f V 4 t X D 6 Z f q D 4 X K 3 X 9 7 8 l D H n u 9 F Z 2 + 2 z 4 x c a u P 5 n b P L Y W 4 / m + x t 0 A 2 9 s r 9 r G t A j O z Q 7 b 9 p t R m S v E O Z 8 5 J 9 p T I P C d k q m l J w h h L C G d W x y D L s 4 w B H S y P N 1 N f R x I Q W l L U k e 5 y g F J o G u 4 F t J O 5 9 j a I 2 O < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " g m 4 K y D N 5 y k T 3 x t e l 2 J P S l n K q P O J 1 L z q B A V j R 5 0 0 h J q e T U 6 r y U R i W R 1 a r d R 0 K X 1 1 4 j A b 6 6 M 4 h d H z M 9 G q S R + q s e D G r r G O T b g w U U A e A g 4 k 5 z Z 0 B P y t Q Q P B Y 2 w D J c Z 8 z q y w L n C I C H M L 3 C W 4 Q 2 d 0 l / 9 b f F q r o g 6 f y z O D k G 3 y L T Y v n 5 k 9 6 K c H u q Q X u q c r e q L 3 X 2 e V w h n l t + z z b l S 4 w s t G j 7 q W 3 v 5 l 5 X m X 2 P 5 k / c E w u P t v T R I 5 y S q 5 i p V K T d X T T n w J Z f s D e R u h B g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " g m 4 K y D N 5 y k T 3 x t e l 2 J P S l n K q P O J 1 L z q B A V j R 5 0 0 h J q e T U 6 r y U R i W R 1 a r d R 0 K X 1 1 4 j A b 6 6 M 4 h d H z M 9 G q S R + q s e D G r r G O T b g w U U A e A g 4 k 5 z Z 0 B P y t Q Q P B Y 2 w D J c Z 8 z q y w L n C I C H M L 3 C W 4 Q 2 d 0 l / 9 b f F q r o g 6 f y z O D k G 3 y L T Y v n 5 k 9 6 K c H u q Q X u q c r e q L 3 X 2 e V w h n l t + z z b l S 4 w s t G j 7 q W 3 v 5 l 5 X m X 2 P 5 k / c E w u P t v T R I 5 R L 3 h 5 Z 1 3 7 j 3 3 v c M 1 P N s K J N F t m 9 L e 0 f m h q 7 s n 9 r H 3 0 + e + e P 9 A L n C r v i m y p m u 7 f t 7 Q A 2 F b j s h K S 9 o i 7 / l C r x i 2 2 D D 2 F 5 v 5 j Z r w A 8 t 1 1 u W R J 7 Y q + o 5 j l S 1 T l 0 y V 4 h N F K Q 6 l l P W 8 u p L K r q h O U D X 2 1 D l D q L m l 9 I K a T q X W 1 a + b r Z w u p a / O n p T 2 a L 7 l i H e j p R V e q e 9 0 K P w g 4 Y y r Z S U s r L b K l X a I s 0 g / g n l 4 A l 2 L q I l < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " E 1 o Z q r / 5 j w 9 q 8 9 l a r a b J b q v U k m j H 8 E 1 r w B 5 T 3 o p 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " m X t E x 9 P 6 N C m z c g J T 9 E D X 9 E L 3 d E N P 9 P 5 r r T C q 0 f x L l X e 9 p R V e c e h s b P P t X 5 X N u 0 T p U / W H Q u f s v z 1 J H C I T e T H Z m x c x T Z d G q 3 7 l t P 6 y u Z C b C q f p k p 7 Z X 4 M e 6 Y 4 d O p V X 4 2 p D 5 M 6 R 4 A a p 3 9 v x E 2 z P p l V K q x t z q e x y 3 K o e j G M S M 9 y P e W S x i n X k + d 0 T 1 H G B h j K i Z J R F Z a m V q r T F m l F 8 C W X t A 5 a r l Y w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C Y w s + T k V g l l P J Y m z c g J T 9 E D X 9 E L 3 d E N P 9 P 5 r r T C q 0 f x L l X e 9 p R V e c e h s b P P t X 5 X N u 0 T p U / W H Q u f s v z 1 J H C I T e T H Z m x c x T Z d G q 3 7 l t P 6 y u Z C b C q f p k p 7 Z X 4 M e 6 Y 4 d O p V X 4 2 p D 5 M 6 R 4 A a p 3 9 v x E 2 z P p l V K q x t z q e x y 3 K o e j G M S M 9 y P e W S x i n X k + d 0 T 1 H G B h j K i Z J R F Z a m V q r T F m l F 8 C W X t A 5 a r l Y w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C Y w s + T k V g l l P J Y m z c g J T 9 E D X 9 E L 3 d E N P 9 P 5 r r T C q 0 f x L l X e 9 p R V e c e h s b P P t X 5 X N u 0 T p U / W H Q u f s v z 1 J H C I T e T H Z m x c x T Z d G q 3 7 l t P 6 y u Z C b C q f p k p 7 Z X 4 M e 6 Y 4 d O p V X 4 2 p D 5 M 6 R 4 A a p 3 9 v x E 2 z P p l V K q x t z q e x y 3 K o e j G M S M 9 y P e W S x i n X k + d 0 T 1 H G B h j K i Z J R F Z a m V q r T F m l F 8 C W X t A 5 a r l Y w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " C Y w s + T k V g l l P J Y m z c g J T 9 E D X 9 E L 3 d E N P 9 P 5 r r T C q 0 f x L l X e 9 p R V e c e h s b P P t X 5 X N u 0 T p U / W H Q u f s v z 1 J H C I T e T H Z m x c x T Z d G q 3 7 l t P 6 y u Z C b C q f p k p 7 Z X 4 M e 6 Y 4 d O p V X 4 2 p D 5 M 6 R 4 A a p 3 9 v x E 2 z P p l V K q x t z q e x y 3 K o e j G M S M 9 y P e W S x i n 4 x y X u F J G l U U l q 8 S 5 S l u s G c W P U N Y + A R R 5 l b 0 = < / l a t e x i t > Figure 1: An illustration of our network for modeling P (path|w 1 , w 2 ).Given a word pair (dog, animal), our model makes h of (dog, animal) similar to v path of the observed co-occurring dependency path X/NOUN/nsubj/> be/VERB/ROOT/-Y/NOUN/attr/< and dissimilar to v path of the unobserved paths, such as X/NOUN/nsubj/> use/VERB/ROOT/-Y/NOUN/dobj/<, through unsupervised learning.
Figure 1 depicts our network structure, which is described below.

Data and Network Architecture
We are able to extract many triples (w 1 , w 2 , path) from a corpus after dependency parsing.We denote a set of these triples as D. These triples are the instances used for the unsupervised learning of P (path|w 1 , w 2 ).Given (w 1 , w 2 , path), our model learns through predicting path from w 1 and w 2 .
We encode word pairs into dense vectors as follows: where [v w 1 ; v w 2 ] is the concatenation of the word embeddings of w 1 and w 2 ; W 1 , b 1 , W 2 , and b 2 are the parameter matrices and bias parameters of the two linear transformations; and h(w 1 ,w 2 ) is the representation of the word pair.
We associate each path with the embedding v path , initialized randomly.While we use a simple way to represent dependency paths in this paper, LSTM can be used to encode each path in the way described in Section 2.2.If LSTM is used, learning time increases but similarities among paths will be captured.

Objective
We used the negative sampling objective for training (Mikolov et al., 2013).Given the word pair representations h(w 1 ,w 2 ) and the dependency path representations v path , our model was trained to distinguish real (w 1 , w 2 , path) triples from incorrect ones.The log-likelihood objective is as follows: where, D is the set of randomly generated negative samples.
We constructed n triples (w 1 , w 2 , path ) for each (w , w 2 , path) ∈ D, where n is a hyperparameter and each path is drawn according to its unigram distribution raised to the 3/4 power.The objective L was maximized using the stochastic gradient descent algorithm.

Path Data Augmentation
After the unsupervised learning described above, our model of P (path|w 1 , w ) can assign the plausibility score σ(v path • h(w 1 ,w 2 ) ) to the cooccurrences of a word pair and a dependency path.We can then append the plausible dependency paths to paths(w 1 , w 2 ), the set of dependency paths that connects w 1 and w 2 in the corpus, based on these scores.
We calculate the score of each dependency path given (X = w 1 , Y = w 2 ) and append the k dependency paths with the highest scores to paths(w 1 , w 2 ), where k is a hyperparameter.We perform the same process given (X = w 2 , Y = w 1 ) with the exception of swapping the X and Y in the dependency paths to be appended.As a result, we add 2k dependency paths to the set of dependency paths for each word pair.Through this data augmentation, we can obtain plausible dependency paths even when word pairs do not co-occur in the corpus.Note that we retain the empty path indicators of paths(w 1 , w 2 ), as we believe that this information contributes to classifying two unrelated words.

Feature Extractor of Word Pairs
Our model can be used as a feature extractor of word pairs.We can exploit h(w 1 ,w 2 ) to represent the word pair (w 1 , w ).This representation captures the information of co-occurrence dependency paths of (w 1 , w ) in a generalized fashion.Thus, h(w 1 ,w 2 ) is used to construct the pseudo-path representation v p−paths(w 1 ,w 2 ) .With our model, we represent the word pair (w 1 , w 2 ) as datasets relations K&H+N hypernym, meronym, co-hyponym, random BLESS hypernym, meronym, co-hyponym, random ROOT09 hypernym, co-hyponym, random EVALution hypernym, meronym, attribute, synonym, antonym, holonym, substance meronym follows: This representation can be used for word pair classification tasks, such as lexical semantic relation detection.

Experiment
In this section, we examine how our method improves path-based models on several datasets for recognizing lexical semantic relations.In this paper, we focus on major noun relations, such as hypernymy, co-hypernymy, and meronymy.

Dataset
We relied on the datasets used in Shwartz and Dagan (2016); K&H+N (Necsulescu et al., 2015).BLESS (Baroni and Lenci, 2011), EVALution (Santus et al., 2015), and ROOT09 (Santus et al., 2016).These datasets were constructed with knowledge resources (e.g., WordNet, Wikipedia), crowd-sourcing, or both.We used noun pair instances of these datasets. 1Table 1 displays the relations in each dataset used in our experiments.Note that we removed the two relations Entails and MemberOf with few instances from EVALution following Shwartz and Dagan (2016).For data splitting, we used the presplitted train/val/test sets from Shwartz and Dagan (2016) after removing all but the noun pairs from each set.

Corpus and Dependency Parsing
For path-based methods, we used the June 2017 Wikipedia dump as a corpus and extracted (w 1 , w 2 , path) triples of noun pairs using the dependency parser of spaCy 2 to construct D. In this process, w 1 and w 2 were lemmatized with spaCy.We only used the dependency paths which oc- 1 We focused only noun pairs to shorten the unsupervised learning time, though this restriction is not necessary for our methods and the unsupervised learning is still tractable.2 displays the number of instances and the proportion of the instances for which at least one dependency path was obtained.

Baseline
We conducted experiments with three neural pathbased methods.The implementation details below follow those in Shwartz and Dagan (2016).We implemented all models using Chainer. 4eural Path-Based Model (NPB).We implemented and trained the neural path-based model described in Section 2.2.We used the two-layer LSTM with 60-dimensional hidden units.An input vector was composed of embedding vectors of the lemma (50 dims), POS (4 dims), dependency label (5 dims), and dependency direction (1 dim).Regularization was applied by a dropout on each of the components embeddings (Iyyer et al., 2015;Kiperwasser and Goldberg, 2016).
LexNET.We implemented and trained the integrated model LexNET as described in Section 2.2.The LSTM details are the same as in the NPB model.
LexNET h.This model, a variant of LexNET, has an additional hidden layer between the output layer and v (w 1 ,w 2 ) of Equation ( 5).Because of this additional hidden layer, this model can take into account the interaction of the path information < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 l O w j v q B R e 6 Z e n W Following Shwartz and Dagan (2016), we optimized each model using Adam (whose learning rate is 0.001) while tuning the dropout rate dr among {0.0, 0.2, 0.4} on the validation set.The minibatch size was set to 100.We initialized the lemma embeddings of LSTM and concatenated the word embeddings of LexNET with the pretrained 50-dimensional GloVe vector. 5Training was stopped if performance on the validation set did not improve for seven epochs, and the best model for test evaluation was selected based on the score of the validation set.

Our Method
We implemented and trained our model of P (path|w 1 , w 2 ), described in Section 3.1, as follows.We used the most frequent 30,000 paths connecting nouns as the context paths for unsupervised learning.We initialized word embeddings with the same pretrained GloVe vector as the baseline models.For unsupervised learning data, we extracted (w 1 , w 2 , path), whose w 1 and w 2 are included in the vocabulary of the GloVe vector, and whose path is included in the context paths, from D. The number of these triples was 217,737,765.
We set the size of h (w 1 ,w 2 ) , h(w 1 ,w 2 ) , and v path for context paths to 100.The negative sampling size n was set to 5. We trained our model for five epochs using Adam (whose learning rate is 0.001).The minibatch size was 100.To preserve the distributional regularity of the pretrained word embeddings, we did not update the input word embeddings during the unsupervised learning.
With our trained model, we applied the two methods described in Section 3.2 and 3.3 to the NPB and LexNET models as follows: +Aug.We added the most plausible 2k paths to each paths(w 1 , w 2 ) as in Section 3.2.We tuned k ∈ {1, 3, 5} on the validation set.
+Rep.We concatenated v p−paths(w 1 ,w 2 ) in Equation (9) with the penultimate layer.To focus on the pure contribution of unsupervised learning, we did not update this component during supervised learning.
Figure 2 illustrates +Aug and +Rep applied to LexNET in the case where the two target words, w 1 and w 2 , do not co-occur in the corpus.

Result
In this section we examine how our methods improved the baseline models.Following the previous research (Shwartz and Dagan, 2016), the performance metrics were the "averaged" F 1 of scikit-learn (Pedregosa et al., 2011), which computes the F 1 for each relation, and reports their average weighted by the number of true instances for each relation.

Path-based Model and Path Data Augmentation
We examined whether or not our path data augmentation method +Aug contributes to the neural path-based method.The results are displayed in Table 3. Applying our path data augmentation method improved the classification performance on each dataset.Especially for K&H+N, the large dataset where the three-fourths of word pairs had no paths, our method significantly improved the performance.This result shows that our path data augmentation effectively solves the missing path problem.Moreover, the model with our method outperforms the baseline on EVALution, in which nearly all word pairs co-occurred in the corpus.This indicates that the predicted paths provide useful information and enhance the path-based classification.We examine the paths that were predicted by our model of P (path|w 1 , w 2 ) in Section 6.1.

Integrated Model and Our Methods
We investigated how our methods using modeling P (path|w 1 , w 2 ) improved the baseline integrated model, LexNET.Table 4 displays the results.
Our proposed methods, +Aug and +Rep, improved the performance of LexNET on each dataset. 6Moreover, the best score on each dataset was achieved by the model to which our methods were applied.These results show that our methods are also effective with the integrated models based on path information and distributional information.
The table also shows that LexNET+Rep outperforms LexNET h, though the former has fewer parameters to be tuned during the supervised learning than the latter.This indicates that the word pair representations of our model capture information beyond the interaction of two word embeddings.We investigate the properties of our word pair representation in Section 6.2.
Finally, We found that applying both methods did not necessarily yield the best performance.A possible explanation for this is that applying both methods is redundant, as both +Aug and +Rep depend on the same model of P (path|w 1 , w 2 ).

Analysis
In this section, we investigate the properties of the predicted dependency paths and word pair representations of our model.

Predicted Dependency Paths
We extracted the word pairs of BLESS without co-occurring dependency paths and predicted the plausible dependency paths of those pairs with our model of P (path|w 1 , w 2 ).The examples are displayed in Table 5 at the top three paths.We used the bold style for the paths that we believe to be indicative or representative for a given relationship.
Our model predicted plausible and indicative dependency paths for each relation, although the predicted paths also contain some implausible or unindicative ones.For hypernymy, our model predicted variants of the is-a path according to domains, such as X is Y manufactured in the clothing domain and X is a species of Y in the animal domain.For (owl, rump), which is a meronymy pair, the top predicted path was X that Y represent.This is not plausible for (owl, rump) but is indicative for meronymy, particularly memberof relations.Moreover, domain-independent paths which indicate meronymy, such as all X have Y, were predicted.For (mug, plastic), one of the predicted paths, X is made from Y, is also a domain-independent indicative path for meronymy.For co-hypernymy, our model predicted domain-specific paths, which indicate that two nouns are of the same kind.For examples, given X leaf and Y and X specie and Y of (carrot, beans), we can infer that both X and Y are plants or vegetables.Likewise, given play X, guitar, and Y of (cello, kazoo), we can infer that both X and Y are musical instruments.These examples show that our path data augmentation is effective for the missing path problem and enhances path-based models.

Visualizing Word Pair Representations
We visualized the word pair representations v p−paths(w 1 ,w 2 ) to examine their specific properties.In BLESS, every pair was annotated with 17 domain class labels.For each domain, we reduced the dimensionality of the representations using t-SNE (Maaten and Hinton, 2008) and plotted the data points of the hypernyms, co-hyponyms, and meronyms.We compared our representations with the concatenation of two word embeddings (pretrained 50-dimensional GloVe).The examples are displayed in Figure 3.
We found that our representations (the top row in Figure 3) grouped the word pairs according to their semantic relation in some specific domains based only on unsupervised learning.This property is desirable for the lexical semantic relation detection task.In contrast to our representations, the concatenation of word embeddings (the bottom row in Figure 3) has little or no such tendency in all domains.The data points of the concatenation of word embeddings are scattered or jumbled.This is because the concatenation of word embeddings cannot capture the relational information of word pairs but only the distributional information of each word (Levy et al., 2015).
This visualization further shows that our word pair representations can be used as pseudo-path representations to alleviate the missing path problem.

Conclusion
In this paper, we proposed the novel methods with modeling P (path|w 1 , w 2 ) to solve the missing path problem.Our neural model of P (path|w 1 , w 2 ) can be learned from a corpus in an unsupervised manner, and can generalize cooccurrences of word pairs and dependency paths.We demonstrated that this model can be applied in the two ways: (1) to augment path data by predicting plausible paths for a given word pair, and (2) to extract from word pairs useful features capturing co-occurring path information.Finally, our experiments demonstrated that our methods can improve upon the previous models and successfully solve the missing path problem.
In future work, we will explore unsupervised learning with a neural path encoder.Our model bears not only word pair representations but also dependency path representations as context vec-tors.Thus, we intend to apply these representations to various tasks, which path representations contribute to.

Figure 3 :
Figure3: Visualization of the our word pair representations v p−paths(w1,w2) (top row) and the concatenation of two word embeddings (bottom row) using t-SNE in some domains.The two axes of each plot, x and y, are the reduced dimensions using t-SNE.

Table 1 :
The relation types in each dataset.

Table 2 :
The number and proportion of instances whose dependency path is obtained from each dataset curred at least five times following the implementation ofShwartz and Dagan (2016). 3able Illustration of +Aug and +Rep applied to LexNET.+Aug predicts plausible paths from two word embeddings, and these paths are fed into the LSTM path encoder.+Rep concatenates the pseudo-path representation v p−paths(w1,w2) with the penultimate layer of LexNET and distributional information of two word embeddings.The size of the additional hidden layer was set to 60.

Table 3 :
Classification performance of the neural path-based model (NPB) and that with the path data augmentation (NPB+Aug).

Table 4 :
Classification performance of the integrated model, LexNET and LexNET h, and those with our methods, +Aug and +Rep.

Table 5 :
Predicted paths with our model for a word pair of each relation in BLESS.