Task-Oriented Learning of Word Embeddings for Semantic Relation Classification

We present a novel learning method for word embeddings designed for relation classification. Our word embeddings are trained by predicting words between noun pairs using lexical relation-specific features on a large unlabeled corpus. This allows us to explicitly incorporate relation-specific information into the word embeddings. The learned word embeddings are then used to construct feature vectors for a relation classification model. On a well-established semantic relation classification task, our method significantly outperforms a baseline based on a previously introduced word embedding method, and compares favorably to previous state-of-the-art models without syntactic information or manually constructed external resources. Furthermore, when incorporating external resources, our method outperforms the previous state of the art.


Introduction
Automatic classification of semantic relations has a variety of applications, such as information extraction and the construction of semantic networks (Girju et al., 2007;Hendrickx et al., 2010). A traditional approach to relation classification is to train classifiers using various kinds of features with class labels annotated by humans. Carefully crafted features derived from lexical, syntactic, and semantic resources play a significant role in achieving high accuracy for semantic relation classification (Rink and Harabagiu, 2010).
In recent years there has been an increasing interest in using word embeddings as an alternative to traditional hand-crafted features. Word embeddings are represented as real-valued vectors and capture syntactic and semantic similarity between words. For example, word2vec 1 (Mikolov et al., 2013b) is a well-established tool for learning word embeddings. Although word2vec has successfully been used to learn word embeddings, these kinds of word embeddings capture only co-occurrence relationships between words (Levy and Goldberg, 2014). While simply adding word embeddings trained using window-based contexts as additional features to existing systems has proven valuable (Turian et al., 2010), more recent studies have focused on how to tune and enhance word embeddings for specific tasks (Bansal et al., 2014;Boros et al., 2014;Chen et al., 2014;Guo et al., 2014;Nguyen and Grishman, 2014) and we continue this line of research for the task of relation classification.
In this work we present a learning method for word embeddings, specifically designed to be useful for relation classification. The overview of our system and the embedding learning process is shown in Figure 1. First we train word embeddings by predicting words between noun pairs using lexical relation-specific features on a large unlabeled corpus. We then use the word embeddings to construct lexical feature vectors for relation classification. Lastly, the feature vectors are used to train a relation classification model.
We evaluate our method on a well-established semantic relation classification task and compare it to a baseline based on word2vec embeddings and previous state-of-the-art models that rely on either manually crafted features, syntactic parses or external semantic resources. Our method significantly outperforms the word2vec-based baseline, and compares favorably with previous stateof-the-art models, despite relying only on lexical level features and no external annotated resources. By incorporating external resources, our method outperforms the previous state of the art. Furthermore, our qualitative analysis of the learned embeddings shows that n-grams of our embeddings capture salient syntactic patterns similar to semantic relation types.

Previous Work
A traditional approach to relation classification is to train classifiers in a supervised fashion using a variety of features. These features include lexical bag-of-words features and features based on syntactic parse trees. For syntactic parse trees, the paths between the target entities on constituency and dependency trees have been demonstrated to be useful (Bunescu and Mooney, 2005;. On the shared task introduced by Hendrickx et al. (2010), Rink and Harabagiu (2010) achieved the best score using a variety of hand-crafted features which were then used to train a Support Vector Machine (SVM).
Recently, word embeddings have been proposed as an alternative to hand-crafted features (Collobert et al., 2011). However, one of the limitations is that word embeddings are usually learned by predicting a target word in its context, leading to only local co-occurrence information being captured (Levy and Goldberg, 2014). Thus, several recent studies have focused on overcoming this limitation. Le and Mikolov (2014) integrated paragraph information into a word2vec-based model, which allowed them to capture paragraphlevel information.
For dependency parsing, Bansal et al. (2014) and Chen et al. (2014) found ways to improve performance by integrating dependency-based context information into their embeddings. Bansal et al. (2014) trained embeddings by defining parent and child nodes in dependency trees as contexts. Chen et al. (2014) introduced the concept of feature embeddings induced by parsing a large unannotated corpus and then learning embeddings for the manually crafted features.
For information extraction, Boros et al. (2014) trained word embeddings relevant for event role extraction, and Nguyen and Grishman (2014) employed word embeddings for domain adaptation of relation extraction. Another kind of task-specific word embeddings was proposed by Tang et al. (2014), that used sentiment labels on tweets to adapt word embeddings for a sentiment analysis tasks. However, such an approach is only feasible when a large amount of labeled data is available.

Relation Classification Using Word Embedding-based Features
We propose a novel method for learning word embeddings designed for relation classification. The word embeddings are trained by predicting each word between noun pairs, given the corresponding low-level features for relation classification. In general, to classify relations between pairs of nouns the most important features come from the pairs themselves and the words between and around the pairs (Hendrickx et al., 2010). For example, in the sentence in Figure 1 (b) there is a cause-effect relationship between the two nouns conflicts and players. To classify the relation, the most common features are the noun pair (conflicts, players), the words between the noun pair (are, caused, by), the words before the pair (the, external), and the words after the pair (playing, tiles, to, ...). As shown by Rink and Harabagiu (2010), the words between the noun pairs are the most effective among these features. Our main idea is to treat the most important features (the words between the noun pairs) as the targets to be predicted and other lexical features (noun pairs, words outside them) as their contexts. Due to this, we expect our embeddings to capture relevant features for relation classification better than previous models which only use window-based contexts.
In this section we first describe the learning process for the word embeddings, focusing on lexical features for relation classification (Figure 1 (b)). We then propose a simple and powerful technique to construct features which serve as input for a softmax classifier. The overview of our proposed system is shown in Figure 1 (a).

Learning Word Embeddings
Assume that there is a noun pair n = (n 1 , n 2 ) in a sentence with M in words between the pair and M out words before and after the pair: • w bef = (w bef 1 , . . . , w bef Mout ) , and • w af t = (w af t 1 , . . . , w af t Mout ) .
Our method predicts a target word w in i ∈ w in using n, words around w in i in w in , and words in w bef and w af t . Words are embedded in a ddimensional vector space and we refer to these vectors as word embeddings. To discriminate between words in n from those in w in , w bef , and w af t , we have two sets of word embeddings: N ∈ R d×|N | and W ∈ R d×|W| . W is a set of words and N is also a set of words but contains only nouns. Hence, the noun apple has two embeddings: one in N and another in W.
A feature vector f ∈ R 2d(2+c)×1 is constructed to predict w in i by concatenating word embeddings: (1) N(·) and W(·) ∈ R d×1 corresponds to each word and c is the context size. A special NULL token is used if i − j is smaller than 1 or i + j is larger than M out for each j ∈ {1, 2, . . . , c}.
Our method then estimates a conditional probability p(w|f ) that the target word is a word w, using a logistic regression model: When training we employ several procedures introduced by Mikolov et al. (2013b), namely, negative sampling, a modified unigram noise distribution and subsampling. For negative sampling the model parameters N, W,W, and b are learned by maximizing the objective function J unlabeled : (3) where w ′ j is a word randomly drawn from the unigram noise distribution weighted by an exponent of 0.75 (Mikolov et al., 2013b). Maximizing J unlabeled means that our method can discriminate between each target word and k noise words given the target word's context.
To reduce redundancy during training we use subsampling. A training sample, whose target word is w, is discarded with the probability P d (w) = 1 − r 1 and r 2 , a training sample whose noun pair is (n 1 , n 2 ) is discarded if P d (n 1 ) is larger than r 1 or P d (n 2 ) is larger than r 2 . This subsampling dramatically reduces the training time.
Since the feature vector f is constructed as defined in Eq. (1), at each training step,W(w) is updated based on information about what pair of nouns surrounds w, what word n-grams appear in a small window around w, and what words appear outside the noun pair. Hence, the weight vector W(w) captures rich information regarding the target word w.

Constructing Feature Vectors
Once the word embeddings are trained, we can use them for relation classification. Given a noun pair n = (n 1 , n 2 ) with its context words w in , w bef , and w af t , we construct a feature vector to classify the relation between n 1 and n 2 by concatenating three kinds of feature vectors: g n the word embeddings of the noun pair, g in averaged n-gram embeddings between the pair, and g out averaged word embeddings outside the pair.
Words between the noun pair contribute to classifying the relation, and one of the most common ways to incorporate an arbitrary number of words is treating them as a bag of words. However, word order information is lost for bag-of-words features such as averaged word embeddings. To incorporate the word order information, we first define ngram embeddings h i ∈ R 4d(1+c)×1 between the noun pair: Note thatW can also be used and that the value used for n is (2c+1). We then compute the feature vector g in by averaging h i : The words before and after the noun pair are sometimes important in classifying the relation. For example, in the phrase "pour n 1 into n 2 ", the word pour should be helpful in classifying the relation. As with Eq. (1), we use the averaged word embeddings of words outside the noun pair to compute the feature vector g out ∈ R 2d×1 : (5) As described above, the overall feature vector e ∈ R 4d(2+c)×1 is constructed by concatenating g n , g in , and g out . We would like to emphasize that we only use simple operations: averaging and concatenating the learned word embeddings. The feature vector e is then used as input for a softmax classifier, without any complex transformation such as matrix multiplication with non-linear functions.

Supervised Learning
Given a relation classification task we train a softmax classifier using the feature vector e described in Section 3.2. For each k-th training sample with a corresponding label l k among L predefined labels, we compute a conditional probability given its feature vector e k : We then define the objective function as: K is the number of training samples and λ controls the L-2 regularization. θ = (N, W,W, S, s) is the set of parameters and J labeled is maximized using AdaGrad (Duchi et al., 2011). We have found that dropout (Hinton et al., 2012) is helpful in preventing our model from overfitting. Concretely, elements in e are randomly omitted with a probability of 0.5 at each training step.
In what follows, we refer to the above method as RelEmb.
While RelEmb uses only low-level features, a variety of useful features have been proposed for relation classification.
Among them, we use dependency path features (Bunescu and Mooney, 2005;Rink and Harabagiu, 2010;Yu et al., 2014) based on the untyped binary dependencies of the Stanford parser to find the shortest path between target nouns. The dependency path features are computed by averaging word embeddings from W on the shortest path, and are then concatenated to the feature vector e. Furthermore, we directly incorporate semantic information using word-level semantic features from Named Entity (NE) tags and WordNet hypernyms, as used in previous work (Rink and Harabagiu, 2010;Socher et al., 2012;Yu et al., 2014). We refer to this extended method as RelEmb FULL . Concretely, RelEmb FULL uses the same binary features as in Socher et al. (2012). The features come from NE tags and WordNet hypernym tags of target nouns provided by a sense tagger (Ciaramita and Altun, 2006).

Training Data
For pre-training we used a snapshot of the English Wikipedia 2 from November 2013. First, we extracted 80 million sentences from the original Wikipedia file, and then used Enju 3 (Miyao and Tsujii, 2008) to automatically assign part-of-speech tags. From the part-ofspeech tags we used NN, NNS, NNP, or NNPS to locate noun pairs in the corpus. We then collected training data by listing pairs of nouns and the words between, before, and after the noun pairs. A noun pair was omitted if the number of words between the pair was larger than 10 and we consequently collected 1.4 billion pairs of nouns and their contexts. We used the 300,000 most frequent words and the 300,000 most frequent nouns and treated out-of-vocabulary words as a special UNK token.

Initialization and Optimization
We initialized the embedding matrices N and W with zero-mean gaussian noise with a variance of 1 d .W and b were zero-initialized. The model parameters were optimized by maximizing the objective function in Eq. (3) using stochastic gradient ascent. The learning rate was set to α and linearly decreased to 0 during training, as described in Mikolov et al. (2013a). The hyperparameters are the embedding dimensionality d, the context size c, the number of negative samples k, the initial learning rate α, and M out , the number of words outside the noun pairs. For hyperparameter tuning, we first fixed α to 0.025 and M out to 5, and then set d to {50, 100, 300}, c to {1, 2, 3}, and k to {5, 15, 25}.
At the supervised learning step, we initialized S and s with zeros. The hyperparameters, the learning rate for AdaGrad, λ, M out , and the number of iterations, were determined via 10-fold cross validation on the training set for each setting. Note that M out can be tuned at the supervised learning step, adapting to a specific dataset.

Evaluation Dataset
We evaluated our method on the SemEval 2010 Task 8 data set 4 (Hendrickx et al., 2010), which involves predicting the semantic relations between noun pairs in their contexts. Training example (a) is classified as Cause-Effect(E 1 , E 2 ) which denotes that E 2 is an effect caused by E 1 , while training example (b) is classified as Cause-Effect(E 2 , E 1 ) which is the inverse of Cause-Effect(E 1 , E 2 ). We report the official macro-averaged F1 scores and accuracy.

Models
To empirically investigate the performance of our proposed method we compared it to several baselines and previously proposed models.

Random and word2vec Initialization
Rand-Init. The first baseline is RelEmb itself, but without applying the learning method on the unlabeled corpus. In other words, we train the softmax classifier from Section 3.3 on the labeled training data with randomly initialized model parameters.

W2V-Init.
The second baseline is RelEmb using word embeddings learned by word2vec. More specifically, we initialize the embedding matrices N and W with the word2vec embeddings. Related to our method, word2vec has a set of weight vectors similar toW when trained with negative sampling and we use these weight vectors as a replacement forW. We trained the word2vec embeddings using the CBOW model with subsampling on the full Wikipedia corpus. As with our experimental settings, we fix the learning rate to 0.025, and investigate several hyperparameter settings. For hyperparameter tuning we set the embedding dimensionality d to {50, 100, 300}, the context size c to {1, 3, 9}, and the number of negative samples k to {5, 15, 25}.

SVM-Based Systems
A simple approach to the relation classification task is to use SVMs with standard binary bagof-words features. The bag-of-words features included the noun pairs and words between, before, and after the pairs, and we used LIBLINEAR 5 as our classifier.

Neural Network Models
Socher et al. (2012) used Recursive Neural Network (RNN) models to classify the relations. The MVRNN model used pre-trained word embeddings and semantic features. Subsequently, Hashimoto et al. (2013) proposed an RNN model to better handle the relations using rich syntactic information. Since these methods are recursively compositional, they rely on syntactic parse trees. Better results were achieved using a Convolutional Neural Network (CNN) model with Word-Net hypernyms (Zeng et al., 2014). Noteworthy in relation to the RNN-based methods, the CNN model does not rely on parse trees. Both the RNN and CNN models used the SENNA word embeddings 6 , trained on an English Wikipedia snapshot.
The current state-of-the-art results were achieved by Yu et al. (2014) using their novel Factor-based Compositional Model (FCM). In their work Yu et al. (2014) presented results from several model variants, the best performing being FCM EMB and FCM FULL . The former only uses word embedding information and the latter, which achieved the best score, relies on dependency path and NE features, in addition to word embeddings.

Results and Discussion
The scores on the test set for SemEval 2010 Task 8 are shown in Table 1. RelEmb achieves 82.8% of F1 which is better than those of almost all models compared and comparable to that of the previous state of the art. Note that RelEmb does not rely on external semantic features and syntactic parse features 7 . Furthermore, RelEmb FULL achieves a better score (83.5% of F1) in comparison to the previous state of the art. We calculated a confidence interval (82.0, 84.9) (p < 0.05) using bootstrap resampling (Noreen, 1989).
Comparison with the baselines. RelEmb significantly outperforms not only the Rand-Init baseline, but also the W2V-Init baseline. These results show that our task-specific word embeddings are more useful than those trained using window-based contexts. A point that we would like to emphasize is that the baselines are unexpectedly strong.
As was noted by Wang and Manning (2012), we should carefully implement strong baselines and see whether complex models can outperform these baselines.

Comparison with SVM-based systems.
RelEmb performs much better than the bagof-words-based SVM. This is not surprising given that we use a large unannotated corpus and embeddings with a large number of parameters. RelEmb also outperforms the SVM system of Rink and Harabagiu (2010), which demonstrates the effectiveness of our task-specific word embeddings, despite our only requirement being a large unannotated corpus and a POS tagger.

Comparison with neural network models.
RelEmb outperforms the RNN models. In our preliminary experiments, we have found some undesirable parse trees when computing vector representations using RNN-based models and such parsing errors might hamper the performance of the RNN models.
79.4 / n/a  The previous state of the art was achieved by FCM FULL , which relies on dependency path and NE features. Without such features, RelEmb outperforms FCM EMB by a large margin. By incorporating external resources, RelEmb FULL outperforms FCM FULL . Yu et al. (2014) did not report positive results when using WordNet features for their best model. Another point which we would like to emphasize is that our method at the supervised learning step is simpler than FCM which is based on a log-quadratic model. When using a log-linear model, FCM's F1 score drops to 82.0% of F1 despite the use of dependency path and WordNet features (Yu et al., 2014).

Analysis on Training Settings
We perform analysis of the training procedure focusing on RelEmb.
Effects of tuning hyperparameters. In Table 2 and 3, we show how tuning the hyperparameters of our method and word2vec affects the classification results using 10-fold cross validation on the training set. The same split is used for each set-   ting, so all results are comparable to each other. The best settings for the cross validation are used to produce the results reported in Table 1. Table 2 shows F1 scores obtained by RelEmb. The results for d = 50, 100 show that RelEmb benefits from relatively large context sizes. The n-gram embeddings in RelEmb capture richer information by setting c to 3 compared to setting c to 1. Relatively large numbers of negative samples also slightly boost the scores. As opposed to these trends, the score does not improve using d = 300. We use the best setting (c = 3, d = 100, k = 25) for the remaining analysis. We note that RelEmb FULL achieves an F1-score of 82.5.
We also performed similar experiments for the W2V-Init baseline, and the results are shown in Table 3. In this case, the number of negative samples does not affect the scores, and the best score is achieved by c = 1. As discussed in    Bansal et al. (2014), the small context size captures the syntactic similarity between words rather than the topical similarity. This result indicates that syntactic similarity is more important than topical similarity for this task. Compared to the word2vec embeddings, our embeddings capture not only local context information using word order, but also long-range co-occurrence information by being tailored for the specific task.
Ablation tests. As described in Section 3.2, we concatenate three kinds of feature vectors, g n , g in , and g out , for supervised learning. Table 4 shows classification scores for ablation tests using 10fold cross validation. We also provide a score using a simplified version of g in , where the feature vector g ′ in is computed by averaging the word embeddings [W(w in i );W(w in i )] of the words between the noun pairs. This feature vector g ′ in then serves as a bag-of-words feature. Table 4 clearly shows that the averaged n-gram embeddings contribute the most to the semantic relation classification performance. The difference between the scores of g in and g ′ in shows the effectiveness of our averaged n-gram embeddings.

Effects of dropout.
At the supervised learning step we use dropout to regularize our model.Without dropout, our performance drops from 82.2% to 81.3% of F1 on the training set.
Training Time. Table 5 shows the training times for the embeddings on the Wikipedia corpus using a single CPU core of an Intel Xeon CPU X5680 3.33GHz processor and clearly shows that subsampling is effective in terms of both the training time and classification accuracy. It should also be noted that pre-training is the largest contributor to computational time for our method and that the supervised learning step only takes approximately three minutes.

Qualitative Analysis on the Embeddings
Using the n-gram embeddings h i in Eq. (4), we inspect which n-grams are relevant for each relation class after the supervised learning step of RelEmb. When the context size c is 3, we can use at most 7-grams. The learned weight matrix S in Section 3.3 is used to detect the most relevant ngrams for each class. More specifically, for each n-gram embedding (n = 1, 3) in the training set, we compute the dot product between the n-gram embedding and the corresponding components in S. We then select the pairs of n-grams and class labels with the highest scores. In Table 6 we show the top five n-grams for six classes. These results clearly show that the n-gram embeddings capture salient syntactic patterns which are useful for the relation classification task.

Conclusion and Future Work
We have presented a word embedding learning method specifically designed for relation classification. The word embeddings are trained using large unlabeled corpora to capture lexical features for relation classification. On a well-established semantic relation classification task our method significantly outperforms the baseline based on word2vec. Our method also compares favorably to previous state-of-the-art models that rely on syntactic parsers and external semantic resources, despite our method requiring only access to an unan-notated corpus and a part-of-speech tagger. Furthermore, our method achieves the new state of the art by incorporating external resources. For future work, we seek to investigate how relation labels can help when learning embeddings in a semisupervised learning setting.