Improved Relation Extraction with Feature-rich Compositional Embedding Models

Compositional embedding models build a representation (or embedding) for a linguistic structure based on its component word embeddings. We propose a Feature-rich Compositional Embedding Model (FCM) for relation extraction that is expressive, generalizes to new domains, and is easy-to-implement. The key idea is to combine both (unlexicalized) hand-crafted features with learned word embeddings. The model is able to directly tackle the difficulties met by traditional compositional embeddings models, such as handling arbitrary types of sentence annotations and utilizing global information for composition. We test the proposed model on two relation extraction tasks, and demonstrate that our model outperforms both previous compositional models and traditional feature rich models on the ACE 2005 relation extraction task, and the SemEval 2010 relation classification task. The combination of our model and a log-linear classifier with hand-crafted features gives state-of-the-art results.


Introduction
Two common NLP feature types are lexical properties of words and unlexicalized linguistic/structural interactions between words. Prior work on relation extraction has extensively studied how to design such features by combining discrete lexical properties (e.g. the identity of a word, ⇤ ⇤ Gormley and Yu contributed equally. 1 https://github.com/mgormley/pacaya its lemma, its morphological features) with aspects of a word's linguistic context (e.g. whether it lies between two entities or on a dependency path between them). While these help learning, they make generalization to unseen words difficult. An alternative approach to capturing lexical information relies on continuous word embeddings 2 as representative of words but generalizable to new words. Embedding features have improved many tasks, including NER, chunking, dependency parsing, semantic role labeling, and relation extraction (Miller et al., 2004;Turian et al., 2010;Koo et al., 2008;Roth and Woodsend, 2014;Sun et al., 2011;Plank and Moschitti, 2013;Nguyen and Grishman, 2014). Embeddings can capture lexical information, but alone they are insufficient: in state-of-the-art systems, they are used alongside features of the broader linguistic context.
In this paper, we introduce a compositional model that combines unlexicalized linguistic context and word embeddings for relation extraction, a task in which contextual feature construction plays a major role in generalizing to unseen data. Our model allows for the composition of embeddings with arbitrary linguistic structure, as expressed by hand crafted features. In the following sections, we begin with a precise construction of compositional embeddings using word embeddings in conjunction with unlexicalized features. Various feature sets used in prior work (Turian et al., 2010;Nguyen and Grishman, 2014;Hermann et al., 2014;Roth and Woodsend, 2014)   A feature that depends on the embedding for this context word could generalize to other lexical indicators of the same relation (e.g. "operating") that don't appear with ART during training. But lexical information alone is insufficient; relation extraction requires the identification of lexical roles: where a word appears structurally in the sentence. In (2), the word "of" between "suburbs" and "Baghdad" suggests that the first entity is part of the second, yet the earlier occurrence after "direction" is of no significance to the relation. Even finer information can be expressed by a word's role on the dependency path between entities. In (3) we can distinguish the word "died" from other irrelevant words that don't appear between the entities. tured as special cases of this construction. Adding these compositional embeddings directly to a standard log-linear model yields a special case of our full model. We then treat the word embeddings as parameters giving rise to our powerful, efficient, and easy-to-implement log-bilinear model. The model capitalizes on arbitrary types of linguistic annotations by better utilizing features associated with substructures of those annotations, including global information. We choose features to promote different properties and to distinguish different functions of the input words. The full model involves three stages. First, it decomposes the annotated sentence into substructures (i.e. a word and associated annotations). Second, it extracts features for each substructure (word), and combines them with the word's embedding to form a substructure embedding. Third, we sum over substructure embeddings to form a composed annotated sentence embedding, which is used by a final softmax layer to predict the output label (relation).
The result is a state-of-the-art relation extractor for unseen domains from ACE 2005 (Walker et al., 2006) and the relation classification dataset from SemEval-2010 Task 8 (Hendrickx et al., 2010).
Contributions This paper makes several contributions, including: 1. We introduce the FCM, a new compositional embedding model for relation extraction. 2. We obtain the best reported results on ACE-2005 for coarse-grained relation extraction in the cross-domain setting, by combining FCM with a log-linear model. 3. We obtain results on on SemEval-2010 Task 8 competitive with the best reported results. Note that other work has already been published that builds on the FCM, such as Hashimoto et al. (2015), , dos Santos 3 In ACE 2005, ART refers to a relation between a person and an artifact; such as a user, owner, inventor, or manufacturer relationship et al. (2015),  and . Additionally, we have extended FCM to incorporate a low-rank embedding of the features , which focuses on fine-grained relation extraction for ACE and ERE. This paper obtains better results than the low-rank extension on ACE coarse-grained relation extraction.

Relation Extraction
In relation extraction we are given a sentence as input with the goal of identifying, for all pairs of entity mentions, what relation exists between them, if any. For each pair of entity mentions in a sentence S, we construct an instance (y, x), where x = (M 1 , M 2 , S, A). S = {w 1 , w 2 , ..., w n } is a sentence of length n that expresses a relation of type y between two entity mentions M 1 and M 2 , where M 1 and M 2 are sequences of words in S. A is the associated annotations of sentence S, such as part-of-speech tags, a dependency parse, and named entities. We consider directed relations: for a relation type Rel, y=Rel(M 1 , M 2 ) and y 0 =Rel(M 2 , M 1 ) are different relations. Table 1 shows ACE 2005 relations, and has a strong label bias towards negative examples. We also consider the task of relation classification (Se-mEval), where the number of negative examples is artificially reduced.
Embedding Models Word embeddings and compositional embedding models have been successfully applied to a range of NLP tasks, however the applications of these embedding models to relation extraction are still limited. Prior work on relation classification (e.g. SemEval 2010 Task 8) has focused on short sentences with at most one relation per sentence (Socher et al., 2012;Zeng et al., 2014). For relation extraction, where negative examples abound, prior work has assumed that only the named entity boundaries and not their types were available (Plank and Moschitti, 2013;. Other work has as-sumed that the order of two entities in a relation are given while the relation type itself is unknown (Nguyen and Grishman, 2014;. The standard relation extraction task, as adopted by ACE 2005 (Walker et al., 2006), uses long sentences containing multiple named entities with known types 4 and unknown relation directions. We are the first to apply neural language model embeddings to this task.

Motivation and Examples
Whether a word is indicative of a relation depends on multiple properties, which may relate to its context within the sentence. For example, whether the word is inbetween the entities, on the dependency path between them, or to their left or right may provide additional complementary information. Illustrative examples are given in Table 1 and provide the motivation for our model. In the next section, we will show how we develop informative representations capturing both the semantic information in word embeddings and the contextual information expressing a word's role relative to the entity mentions. We are the first to incorporate all of this information at once. The closest work is that of Nguyen and Grishman (2014), who use a loglinear model for relation extraction with embeddings as features for only the entity heads. Such embedding features are insensitive to the broader contextual information and, as we show, are not sufficient to elicit the word's role in a relation.

A Feature-rich Compositional Embedding Model for Relations
We propose a general framework to construct an embedding of a sentence with annotations on its component words. While we focus on the relation extraction task, the framework applies to any task that benefits from both embeddings and typical hand-engineered lexical features.

Combining Features with Embeddings
We begin by describing a precise method for constructing substructure embeddings and annotated sentence embeddings from existing (usually unlexicalized) features and embeddings. Note that these embeddings can be included directly in a log-linear model as features-doing so results in a special case of our full model presented in the next subsection. An annotated sentence is first decomposed into substructures. The type of substructures can vary by task; for relation extraction we consider one substructure per word 5 . For each substructure in the sentence we have a hand-crafted feature vector f w i and a dense embedding vector e w i . We represent each substructure as the outer product ⌦ between these two vectors to produce a matrix, herein called a substructure embedding: h w i = f w i ⌦ e w i . The features f w i are based on the local context in S and annotations in A, which can include global information about the annotated sentence. These features allow the model to promote different properties and to distinguish different functions of the words. Feature engineering can be task specific, as relevant annotations can change with regards to each task. In this work we utilize unlexicalized binary features common in relation extraction. Figure 1 depicts the construction of a sentence's substructure embeddings.
We further sum over the substructure embeddings to form an annotated sentence embedding: When both the hand-crafted features and word embeddings are treated as inputs, as has previously been the case in relation extraction, this annotated sentence embedding can be used directly as the features of a log-linear model. In fact, we find that the feature sets used in prior work for many other NLP tasks are special cases of this simple construction (Turian et al., 2010;Nguyen and Grishman, 2014;Hermann et al., 2014;Roth and Woodsend, 2014). This highlights an important connection: when the word embeddings are constant, our constructions of substructure and annotated sentence embeddings are just specific forms of polynomial (specifically quadratic) feature combination-hence their commonality in the literature. Our experimental results suggest that such a construction is more powerful than directly including embeddings into the model.

The Log-Bilinear Model
Our full log-bilinear model first forms the substructure and annotated sentence embeddings from roduct between the feature se a tensor T = L⌦E ⌦F he set of labels, E refers to ture S, we have (2) e we decompose the strucon the model parameters. is a matrix (y, ·, ·). Then (3) the equivalent form: tensor T can be written as: e tensor, making the model ross-entropy objective: [9] to optimize above g; and for each ing P (y|S; T, W ). Then entation of the FCT model. (a) Representation of an input sentence. (b) rameter space.
s, we can represent each factor as the outer product between the feature er of transformed embedding g f ⌦h f . The we use a tensor T = L⌦E ⌦F sform this input matrix to the labels. Here L is the set of labels, E refers to layer (|E| = 200) and F is the set of features.
nditional probability of a label y given the structure S, we have core of label y computed with our model. Since we decompose the strucactor f i 2 S will contribute to the score based on the model parameters. corresponds to a slice of the tensor T y , which is a matrix (y, ·, ·). Then bute a score ensor product, while in the case of Eq. (3), it has the equivalent form: re of label y given an instance S and parameter tensor T can be written as: forms linear transformations on each view of the tensor, making the model lement.
train the parameters we optimize the following cross-entropy objective: of all training data. We used AdaGrad [9] to optimize above re we are performing stochastic training; and for each ins function`=`(y, S; T, W ) = log P (y|S; T, W ).
epresent each factor as the outer product between the feature rmed embedding g f ⌦h f . The we use a tensor T = L⌦E ⌦F put matrix to the labels. Here L is the set of labels, E refers to 200) and F is the set of features.
obability of a label y given the structure S, we have l y computed with our model. Since we decompose the strucwill contribute to the score based on the model parameters. s to a slice of the tensor T y , which is a matrix (y, ·, ·). Then t, while in the case of Eq. (3), it has the equivalent form: given an instance S and parameter tensor T can be written as: transformations on each view of the tensor, making the model meters we optimize the following cross-entropy objective: ning data. We used AdaGrad [9] to optimize above performing stochastic training; and for each in-`=`(y, S; T, W ) = log P (y|S; T, W ).
Then  M1 driving what appeared to be [a taxicab] M2 Figure 1: Example construction of substructure embeddings. Each substructure is a word wi in S, augmented by the target entity information and related information from annotation A (e.g. a dependency tree). We show the factorization of the annotated sentence into substructures (left), the concatenation of the substructure embeddings for the sentence (middle), and a single substructure embedding from that concatenation (right). The annotated sentence embedding (not shown) would be the sum of the substructure embeddings, as opposed to their concatenation.
the previous subsection. The model uses its parameters to score the annotated sentence embedding and uses a softmax to produce an output label. We call the entire model the Feature-rich Compositional Embedding Model (FCM).
Our task is to determine the label y (relation) given the instance x = (M 1 , M 2 , S, A). We formulate this as a probability.
is the 'matrix dot product' or Frobenious inner product of the two matrices. The normalizing constant which sums over all possible output labels y 0 2 L is given by . The parameters of the model are the word embeddings e for each word type and a list of weight matrix T = [T y ] y2L which is used to score each label y. The model is log-bilinear 6 (i.e. log-quadratic) since we recover a log-linear model by fixing either e or T . We study both the full log-bilinear and the log-linear model obtained by fixing the word embeddings.

Discussion of the Model
Substructure Embeddings Similar words (i.e. those with similar embeddings) with similar functions in the sentence (i.e. those with similar features) will have similar matrix representations. To understand our selection of the outer product, consider the example in Fig. 1. The word "driving" can indicate the ART relation if it appears on the 6 Other popular log-bilinear models are the log-bilinear language models (Mnih and Hinton, 2007;Mikolov et al., 2013). dependency path between M 1 and M 2 . Suppose the third feature in f w i indicates this on-path feature. Our model can now learn parameters which give the third row a high weight for the ART label. Other words with embeddings similar to "driving" that appear on the dependency path between the mentions will similarly receive high weight for the ART label. On the other hand, if the embedding is similar but is not on the dependency path, it will have 0 weight. Thus, our model generalizes its model parameters across words with similar embeddings only when they share similar functions in the sentence.

Smoothed Lexical Features
Another intuition about the selection of outer product is that it is actually a smoothed version of traditional lexical features used in classical NLP systems. Consider a lexical feature f = u^w, which is a conjunction (logic-and) between non-lexical property u and lexical part (word) w. If we represent w as a one-hot vector, then the outer product exactly recovers the original feature f . Then if we replace the one-hot representation with its word embedding, we get the current form of our FCM. Therefore, our model can be viewed as a smoothed version of lexical features, which keeps the expressive strength, and uses embeddings to generalize to low frequency features.
Time Complexity Inference in FCM is much faster than both CNNs (Collobert et al., 2011) and RNNs (Socher et al., 2013b;Bordes et al., 2012). FCM requires O(snd) products on average with sparse features, where s is the average number of per-word non-zero feature values, n is the length of the sentence, and d is the dimension of word embedding. In contrast, CNNs and RNNs usually have complexity O(C · nd 2 ), where C is a model dependent constant.

Hybrid Model
We present a hybrid model which combines the FCM with an existing log-linear model. We do so by defining a new model: The log-linear model has the usual form: where ✓ are the model parameters and f (x, y) is a vector of features. The integration treats each model as a providing a score which we multiply together. The constant Z ensures a normalized distribution.

Training
FCM training optimizes a cross-entropy objective: where D is the set of all training data and e is the set of word embeddings. To optimize the objective, for each instance (y, x) we perform stochastic training on the loss function`= (y, x; T, e) = log P (y|x; T, e). The gradients of the model parameters are obtained by backpropagation (i.e. repeated application of the chain rule). We define the vector s = [ where the indicator function I[x] equals 1 if x is true and 0 otherwise. We have the following gradients: @@ T = @@ s ⌦ P n i=1 f w i ⌦ e w i , which is equivalent to: When we treat the word embeddings as parameters (i.e. the log-bilinear model), we also fine-tune the word embeddings with the FCM model: As is common in deep learning, we initialize these embeddings from an neural language model and then fine-tune them for our supervised task. The training process for the hybrid model ( § 4) is also easily done by backpropagation since each sub-model has separate parameters.

Experimental Settings
Features Our FCM features (Table 2) use a feature vector f w i over the word w i , the two target entities M 1 , M 2 , and their dependency path.
Here h 1 , h 2 are the indices of the two head words of M 1 , M 2 , ⇥ refers to the Cartesian product between two sets, t h 1 and t h 2 are entity types (named entity tags for ACE 2005 or WordNet supertags for SemEval 2010) of the head words of two entities, and stands for the empty feature. refers to the conjunction of two elements. The In-between features indicate whether a word w i is in between two target entities, and the On-path features indicate whether the word is on the dependency path, on which there is a set of words P , between the two entities. We also use the target entity type as a feature. Combining this with the basic features results in more powerful compound features, which can help us better distinguish the functions of word embeddings for predicting certain relations. For example, if we have a person and a vehicle, we know it will be more likely that they have an ART relation. For the ART relation, we introduce a corresponding weight vector, which is closer to lexical embeddings similar to the embedding of "drive".
All linguistic annotations needed for features (POS, chunks 7 , parses) are from Stanford CoreNLP (Manning et al., 2014). Since SemEval does not have gold entity types we obtained Word-Net and named entity tags using Ciaramita and Altun (2006). For all experiments we use 200d word embeddings trained on the NYT portion of the Gigaword 5.0 corpus (Parker et al., 2011), with word2vec (Mikolov et al., 2013). We use the CBOW model with negative sampling (15 negative words). We set a window size c=5, and remove types occurring less than 5 times.  Zhou et al. (2005) plus several additional carefully-chosen features that have been highly tuned for ACE-style relation extraction over years of research. We exclude the Country gazetteer and WordNet features from Zhou et al. (2005). The two remaining methods are hybrid models that integrate FCM as a submodel within the log-linear model ( § 4). We consider two combinations. (4) The feature set of Nguyen and Grishman (2014) obtained by using the embeddings of heads of two entity mentions (+HeadOnly). (5) Our full FCM model (+FCM). All models use L2 regularization tuned on dev data.

ACE 2005
We evaluate our relation extraction system on the English portion of the ACE 2005 corpus (Walker et al., 2006). 8 There are 6 domains: Newswire (nw), Broadcast Conversation (bc), Broadcast News (bn), Telephone Speech (cts), Usenet Newsgroups (un), and Weblogs (wl). Following prior work we focus on the domain adaptation setting, where we train on one set (the union of the news domains (bn+nw), tune hyperparameters on a dev domain (half of bc) and evaluate on the remainder (cts, wl, and the remainder of bc) (Plank and Moschitti, 2013;Nguyen and Grishman, 2014). We assume that gold entity spans and types are available for train and test. We use all pairs of entity mentions to yield 43,518 total relations in the training set. We report precision, recall, and F1 for relation extraction. While it is not our focus, for completeness we include results with unknown entity types following Plank and Moschitti (2013) (Appendix 1).

SemEval 2010 Task 8
We evaluate on the Se-mEval 2010 Task 8 dataset 9 (Hendrickx et al., 2010) to compare with other compositional models and highlight the advantages of FCM. This task is to determine the relation type (or no relation) between two entities in a sentence. We adopt the setting of Socher et al. (2012). We use 10-fold 8 Many relation extraction systems evaluate on the ACE 2004 corpus (Mitchell et al., 2005). Unfortunately, the most common convention is to use 5-fold cross validation, treating the entirety of the dataset as both train and evaluation data. Rather than continuing to overfit this data by perpetuating the cross-validation convention, we instead focus on ACE 2005. 9 http://docs.google.com/View?docid=dfvxd49s_36c28v9pmw cross validation on the training data to select hyperparameters and do regularization by early stopping. The learning rates for FCM with/without fine-tuning are 5e-3 and 5e-2 respectively. We report macro-F1 and compare to previously published results.

Results
ACE 2005 Despite FCM's (1) simple feature set, it is competitive with the log-linear baseline (3) on out-of-domain test sets (Table 3). In the typical gold entity spans and types setting, both Plank and Moschitti (2013) and Nguyen and Grishman (2014) found that they were unable to obtain improvements by adding embeddings to baseline feature sets. By contrast, we find that on all domains the combination baseline + FCM (5) obtains the highest F1 and significantly outperforms the other baselines, yielding the best reported results for this task. We found that fine-tuning of embeddings (2) did not yield improvements on our out-of-domain development set, in contrast to our results below for SemEval. We suspect this is because fine-tuning allows the model to overfit the training domain, which then hurts performance on the unseen ACE test domains. Accordingly, Table 3 shows only the log-linear model.
Finally, we highlight an important contrast between FCM (1) and the log-linear model (3): the latter uses over 50 feature templates based on a POS tagger, dependency parser, chunker, and constituency parser. FCM uses only a dependency parse but still obtains better results (Avg. F1). Table 4 shows FCM compared to the best reported results from the SemEval-2010 Task 8 shared task and several other compositional models.

SemEval 2010 Task 8
For the FCM we considered two feature sets. We found that using NE tags instead of WordNet tags helps with fine-tuning but hurts without. This may be because the set of WordNet tags is larger making the model more expressive, but also introduces more parameters. When the embeddings are fixed, they can help to better distinguish different functions of embeddings. But when fine-tuning, it becomes easier to over-fit. Alleviating over-fitting is a subject for future work ( § 9).
With either WordNet or NER features, FCM achieves better performance than the RNN and MVRNN. With NER features and fine-tuning, it outperforms a CNN (Zeng et al., 2014) and also   the combination of an embedding model and a traditional log-linear model (RNN/MVRNN + linear) (Socher et al., 2012). As with ACE, FCM uses less linguistic resources than many close competitors (Rink and Harabagiu, 2010).
We also compared to concurrent work on enhancing the compositional models with taskspecific information for relation classification, including Hashimoto et al. (2015) (RelEmb), which trained task-specific word embeddings, and dos Santos et al. (2015) (CR-CNN), which proposed a task-specific ranking-based loss function. Our Hybrid methods (FCM + linear) get comparable results to theirs. Note that their base compositional model results without any task-specific enhancements, i.e. RelEmb with word2vec embeddings and CR-CNN with log-loss, are still lower than the best FCM result. We believe that FCM can be also improved with these task-specific enhancements, e.g. replacing the word embeddings to the taskspecific ones from (Hashimoto et al., 2015) increases the result to 83.7% (see §7.2 for details). We leave the application of ranking-based loss to future work.
Finally, a concurrent work (Liu et al., 2015) proposes DepNN, which builds representations for the dependency path (and its attached subtrees) between two entities by applying recursive and convolutional neural networks successively. Compared to their model, our FCM achieves comparable results. Of note, our FCM and the RelEmb are also the most efficient models among all above compositional models since they have linear time complexity with respect to the dimension of embeddings.

Effects of the embedding sub-models
We next investigate the effects of different types of features on FCM using ablation tests on ACE 2005 (

Effects of the word embeddings
Good word embeddings are critical for both FCM and other compositional models. In this section, we show the results of FCM with embeddings used to initialize other recent state-of-the-art models. Those embeddings include the 300-d baseline embeddings trained on English Wikipedia (w2venwiki-d300) and the 100-d task-specific embeddings (task-specific-d100) 10 from the RelEmb paper (Hashimoto et al., 2015), the 400-d embeddings from the CR-CNN paper (dos Santos et al., 2015). Moreover, we list the best result (DepNN) in Liu et al. (2015), which uses the same embeddings as ours. Table 6 shows the effects of word embeddings on FCM and provides relative comparisons between FCM and the other state-of-the-art models. We use the same hyperparameters and number of iterations in Table 4. The results show that using different embeddings to initialize FCM can improve F1 beyond our previous results. We also find that increasing the dimension of the word embeddings does not necessarily lead to better results due to the problem of over-fitting (e.g.w2v-enwiki-d400 vs. w2v-enwiki-d300). With the same initial embeddings, FCM usually gets better results without any changes to the hyperparameters than the competing model, further confirming the advantage of FCM at the model-level as discussed under Table 4. The only exception is the DepNN model, which gets better result than FCM on the same embeddings. The task-specific embeddings from (Hashimoto et al., 2015) leads to the best performance (an improvement of 0.7%). This observa-10 In the task-specific setting, FCM will represent entity words and context words with separate sets of embeddings.

Embeddings
Model F1  tion suggests that the other compositional models may also benefit from the work of Hashimoto et al. (2015).

Related Work
Compositional Models for Sentences In order to build a representation (embedding) for a sentence based on its component word embeddings and structural information, recent work on compositional models (stemming from the deep learning community) has designed model structures that mimic the structure of the input. For example, these models could take into account the order of the words (as in Convolutional Neural Networks (CNNs)) (Collobert et al., 2011) or build off of an input tree (as in Recursive Neural Networks (RNNs) or the Semantic Matching Energy Function) (Socher et al., 2013b;Bordes et al., 2012).
While these models work well on sentence-level representations, the nature of their designs also limits them to fixed types of substructures from the annotated sentence, such as chains for CNNs and trees for RNNs. Such models cannot capture arbitrary combinations of linguistic annotations available for a given task, such as word order, dependency tree, and named entities used for relation extraction. Moreover, these approaches ignore the differences in functions between words appearing in different roles. This does not suit more general substructure labeling tasks in NLP, e.g. these models cannot be directly applied to relation extraction since they will output the same result for any pair of entities in a same sentence.
Compositional Models with Annotation Features To tackle the problem of traditional compositional models, Socher et al. (2012) made the RNN model specific to relation extraction tasks by working on the minimal sub-tree which spans the two target entities. However, these specializations to relation extraction does not generalize easily to other tasks in NLP. There are two ways to achieve such specialization in a more general fashion: 1. Enhancing Compositional Models with Features. A recent trend enhances compositional models with annotation features. Such an approach has been shown to significantly improve over pure compositional models. For example, Hermann et al. (2014) and Nguyen and Grishman (2014) gave different weights to words with different syntactic context types or to entity head words with different argument IDs. Zeng et al. (2014) use concatenations of embeddings as features in a CNN model, according to their positions relative to the target entity mentions. Belinkov et al. (2014) enrich embeddings with linguistic features before feeding them forward to a RNN model. Socher et al. (2013a) and Hermann and Blunsom (2013) enhanced RNN models by refining the transformation matrices with phrase types and CCG super tags.
2. Engineering of Embedding Features. A different approach to combining traditional linguistic features and embeddings is hand-engineering features with word embeddings and adding them to log-linear models. Such approaches have achieved state-of-the-art results in many tasks including NER, chunking, dependency parsing, semantic role labeling, and relation extraction (Miller et al., 2004;Turian et al., 2010;Koo et al., 2008;Roth and Woodsend, 2014;Sun et al., 2011;Plank and Moschitti, 2013). Roth and Woodsend (2014) considered features similar to ours for semantic role labeling.
However, in prior work both of above approaches are only able to utilize limited information, usually one property for each word. Yet there may be different useful properties of a word which can contribute to the performances of the task. By contrast, our FCM can easily utilize these features without changing the model structures.
In order to better utilize the dependency annotations, recently work built their models according to the dependency paths (Ma et al., 2015;Liu et al., 2015), which share similar motivations to the usage of On-path features in our work.
Task-Specific Enhancements for Relation Classification An orthogonal direction of improving compositional models for relation classification is to enhance the models with task-specific information. For example, Hashimoto et al. (2015) trained task-specific word embeddings, and dos Santos et al. (2015) proposed a ranking-based loss function for relation classification.

Conclusion
We have presented FCM, a new compositional model for deriving sentence-level and substructure embeddings from word embeddings. Compared to existing compositional models, FCM can easily handle arbitrary types of input and handle global information for composition, while remaining easy to implement. We have demonstrated that FCM alone attains near state-of-the-art performances on several relation extraction tasks, and in combination with traditional feature based loglinear models it obtains state-of-the-art results.
Our next steps in improving FCM focus on enhancements based on task-specific embeddings or loss functions as in Hashimoto et al. (2015;dos Santos et al. (2015). Moreover, as the model provides a general idea for representing both sentences and sub-structures in language, it has the potential to contribute useful components to various tasks, such as dependency parsing, SRL and paraphrasing. Also as kindly pointed out by one anonymous reviewer, our FCM can be applied to the TAC-KBP (Ji et al., 2010) tasks, by replacing the training objective to a multi-instance multilabel one (e.g. Surdeanu et al. (2012)). We plan to explore the above applications of FCM in the future.