Global Textual Relation Embedding for Relational Understanding

Pre-trained embeddings such as word embeddings and sentence embeddings are fundamental tools facilitating a wide range of downstream NLP tasks. In this work, we investigate how to learn a general-purpose embedding of textual relations, defined as the shortest dependency path between entities. Textual relation embedding provides a level of knowledge between word/phrase level and sentence level, and we show that it can facilitate downstream tasks requiring relational understanding of the text. To learn such an embedding, we create the largest distant supervision dataset by linking the entire English ClueWeb09 corpus to Freebase. We use global co-occurrence statistics between textual and knowledge base relations as the supervision signal to train the embedding. Evaluation on two relational understanding tasks demonstrates the usefulness of the learned textual relation embedding. The data and code can be found at https://github.com/czyssrs/GloREPlus


Introduction
Pre-trained embeddings such as word embeddings (Mikolov et al., 2013;Pennington et al., 2014;Peters et al., 2018;Devlin et al., 2018) and sentence embeddings (Le and Mikolov, 2014;Kiros et al., 2015) have become fundamental NLP tools.Learned with large-scale (e.g., up to 800 billion tokens (Pennington et al., 2014)) open-domain corpora, such embeddings serve as a good prior for a wide range of downstream tasks by endowing task-specific models with general lexical, syntactic, and semantic knowledge.
Inspecting the spectrum of granularity, a representation between lexical (and phrasal) level and sentence level is missing.Many tasks require relational understanding of the entities mentioned in the text, e.g., relation extraction and knowledge base completion.Textual relation (Bunescu and Mooney, 2005), defined as the shortest path between two entities in the dependency parse tree of a sentence, has been widely shown to be the main bearer of relational information in text and proved effective in relation extraction tasks (Xu et al., 2015;Su et al., 2018).If we can learn a general-purpose embedding for textual relations, it may facilitate many downstream relational understanding tasks by providing general relational knowledge.
Similar to language modeling for learning general-purpose word embeddings, distant supervision (Mintz et al., 2009) is a promising way to acquire supervision, at no cost, for training general-purpose embedding of textual relations.Recently Su et al. (2018) propose to leverage global co-occurrence statistics of textual and KB relations to learn embeddings of textual relations, and show that it can effectively combat the wrong labeling problem of distant supervision (see Figure 1 for example).While their method, named GloRE, achieves the state-of-the-art performance on the popular New York Times (NYT) dataset (Riedel et al., 2010), the scope of their study is limited to relation extraction with smallscale in-domain training data.
In this work, we take the GloRE approach further and apply it to large-scale, domainindependent data labeled with distant supervision, with the goal of learning general-purpose textual relation embeddings.Specifically, we create the largest ever distant supervision dataset by linking the entire English ClueWeb09 corpus (half a billion of web documents) to the latest version of Freebase (Bollacker et al., 2008), which contains 45 million entities and 3 billion relational facts.After filtering, we get a dataset with over 5 million unique textual relations and around 9 million cooccurring textual and KB relation pairs.We then train textual relation embedding on the collected dataset in a way similar to (Su et al., 2018), but using Transformer (Vaswani et al., 2017) instead of vanilla RNN as the encoder for better training efficiency.
To demonstrate the usefulness of the learned textual relation embedding, we experiment on two relational understanding tasks, relation extraction and knowledge base completion.For relation extraction, we use the embedding to augment PCNN+ATT (Lin et al., 2016) and improve the precision for top 1000 predictions from 83.9% to 89.8%.For knowledge base completion, we replace the neural network in (Toutanova et al., 2015) with our pre-trained embedding followed by a simple projection layer, and gain improvements on both MRR and HITS@10 measures.Our major contributions are summarized as following: • We propose the novel task of learning general-purpose embedding of textual relations, which has the potential to facilitate a wide range of relational understanding tasks.
• To learn such an embedding, we create the largest distant supervision dataset by linking the entire English ClueWeb09 corpus to Freebase.The dataset is publicly available1 .
• Based on the global co-occurrence statistics of textual and KB relations, we learn a textual relation embedding on the collected dataset and demonstrate its usefulness on relational understanding tasks.

Related Work
Distant supervision methods (Mintz et al., 2009) for relation extraction have been studied by a number of works (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012;Zeng et al., 2015;Lin et al., 2016;Ji et al., 2017;Wu et al., 2017).(Su et al., 2018) use global co-occurrence statistics of textual and KB relations to effectively combat the wrong labeling problem.But the global statistics in their work is limited to NYT dataset, capturing domain-specific distributions.
Another line of research that relates to ours is the universal schema (Riedel et al., 2013) for relation extraction, KB completion, as well as its extensions (Toutanova et al., 2015;Verga et al., 2016).Wrong labeling problem still exists since their embedding is learned based on individual relation facts.In contrast, we use the global cooccurrence statistics as explicit supervision signal.

Textual Relation Embedding
In this section, we describe how to collect largescale data via distant supervision ( §3.1) and train the textual relation embedding ( §3.2).

Global Co-Occurrence Statistics from Distant Supervision
To construct a large-scale distant supervision dataset, we first get the English ClueWeb09 corpus (Callan et al., 2009), which contains 500 million web documents.We employ the FACC1 dataset (Gabrilovich et al., 2013)  KB relation r exists in the KB, then we count it as a co-occurrence of t and r.We count the total number of co-occurrences of each pair of textual and KB relation across the entire corpus.We then normalize the global co-occurrence statistics such that each textual relation has a valid probability distribution over all the KB relations, which presumably captures the semantics of the textual relation.In the end, a bipartite relation graph is constructed, with one node set being the textual relations, the other node set being the KB relations, and the weighted edges representing the normalized global co-occurrence statistics.
Filtering.When aligning the text corpus with the KB, we apply a number of filters to ensure data quality and training efficiency: (1) We only use the KB relations in Freebase Commons, 70 domains that are manually verified to be of release quality.(2) Only textual relations with the number of tokens (including both lexical tokens and dependency relations) less than or equal to 10 are kept.
(3) Only non-symmetric textual relations are kept, because symmetric ones are typically from conjunctions like "and" or "or", which are less of interest.(4) Only textual relations with at least two occurrences are kept.After filtering, we end up with a relation graph with 5,559,176 unique textual relations, 1,925 knowledge base (KB) relations, and 8,825,731 edges with non-zero weight.It is worth noting that these filters are very conservative, and we can easily increase the scale of data by relaxing some of the filters.

Embedding Training
Considering both effectiveness and efficiency, we employ the Transformer encoder (Vaswani et al., 2017) to learn the textual relation embedding.It has been shown to excel at learning generalpurpose representations (Devlin et al., 2018).
The embedded textual relation token sequence is fed as input.For example, for the textual relation dobj ←−− f ounded nsubj −−−→, the input is the embedded sequence of {< −dobj >, f ounded, < nsubj >}.We project the output of the encoder to a vector z as the result embedding.Given a textual relation t i and its embedding z i , denote {r 1 , r 2 , ..., r n } as all KB relations, and p(r j |t i ) as the global co-occurrence distribution, the weight of the edge between textual relation t i and KB relation r j in the relation graph.The training objec-tive is to minimize the cross-entropy loss: Where p(r j |t i ) = (sof tmax(W z i + b)) j . (2) W and b are trainable parameters.We use the filtered relation graph in §3.1 as our training data.To guarantee that the model generalizes to unseen textual relations, we take 5% of the training data as validation set.Word embeddings are initialized with the GloVe (Pennington et al., 2014) vectors3 .Dependency relation embeddings are initialized randomly.
For the Transformer model, we use 6 layers and 6 attention heads for each layer.We use the Adam optimizer (Kingma and Ba, 2015) with parameter settings suggested by the original Transformer paper (Vaswani et al., 2017).We train a maximum number of 200 epochs and take the checkpoint with minimum validation loss for the result.
We also compare with using vanilla RNN in GloRE (Su et al., 2018).Denote the embedding trained with Tranformer as GloRE++, standing for both new data and different model, and with RNN as GloRE+, standing for new data.We observe that, in the early stage of training, the validation loss of RNN decreases faster than Transformer.However, it starts to overfit soon.

Experiments
In this section, we evaluate the usefulness of the learned textual relation embedding on two popular relational understanding tasks, relation extraction and knowledge base completion.We do not fine-tune the embedding, and only use in-domain data to train a single feedforward layer to project the embedding to the target relations of the domain.We compare this with models that are specifically designed for those tasks and trained using in-domain data.If we can achieve comparable or better results, it demonstrates that the general-purpose embedding captures useful information for downstream tasks.

Relation Extraction
We experiment on the popular New York Times (NYT) relation extraction dataset (Riedel et al., 2010).Following GloRE (Su et al., 2018) textual relation embeddings of all contextual sentences of an entity pair, and project the average embedding to the target KB relations.We then construct an ensemble model by a weighted combination of predictions from the base model and the textual relation embedding.
Same as (Su et al., 2018), we use PCNN+ATT (Lin et al., 2016) as our base model.GloRE++ improves its best F 1 -score from 42.7% to 45.2%, slightly outperforming the previous state-of-theart (GloRE, 44.7%).As shown in previous work (Su et al., 2018), on NYT dataset, due to a significant amount of false negatives, the PR curve on the held-out set may not be an accurate measure of performance.Therefore, we mainly employ manual evaluation.We invite graduate students to check top 1000 predictions of each method.They are present with the entity pair, the prediction, and all the contextual sentences of the entity pair.Each prediction is examined by two students until reaching an agreement after discussion.Besides, the students are not aware of the source of the predictions.Table 1 shows the manual evaluation results.Both GloRE+ and GloRE++ get improvements over GloRE.GloRE++ obtains the best results for top 700, 900 and 1000 predictions.

Knowledge Base Completion
We experiment on another relational understanding task, knowledge base (KB) completion, on the popular FB15k-237 dataset (Toutanova et al., 2015).The goal is to predict missing relation facts based on a set of known entities, KB relations, and textual mentions.(Toutanova et al., 2015) use a convolutional neural network (CNN) to model textual relations.We replace their CNN with our pretrained embedding followed by one simple feedforward projection layer.
As in (Toutanova et al., 2015), we use the best performing DISTMULT and E+DISTMULT as the base models.DISTMULT (Yang et al., 2015) learns latent vectors for the entities and each relation type, while model E (Riedel et al., 2013) learns two latent vectors for each relation type, associated with its subject and object entities respectively.E+DISTMULT is a combination model that ensembles the predictions from individual models, and is trained jointly.We conduct experiments using only KB relations (KB only), using their CNN to model textual relations (Conv), and using our embedding to model textual relations (Emb).
The models are tested on predicting the object entities of a set of KB triples disjoint from the training set, given the subject entity and the relation type.Table 2 shows the performances of all models measured by mean reciprocal rank (MRR) of the correct entity, and HITS@10 (the percentage of test instances for which the correct entity is ranked within the top 10 predictions).We also show the performances on the two subsets of the test set, with and without textual mentions.The pre-trained embedding achieves comparable or better results to the CNN model trained with indomain data.we apply t-SNE visualization (Maaten and Hinton, 2008) on the learned embedding of ClueWeb validation set.We filter out infrequent textual relations and assign labels to the textual relations when they cooccur more than half of the times with a KB relation.The visualization result of GloRE++ embedding associating with the top-10 frequent KB relations is shown in Figure 2. As we can see, similar textual relations are grouped together while dissimilar ones are separated.This implies that the embedding model can well generate textual relation representation for unseen textual relations, and can potentially serve as relational features to help tasks in unsupervised setting.
Case Study.To show that the embedding model generalizes to unseen textual relations via capturing crucial textual sub-patterns, we randomly pick some textual relations in NYT train set but not in ClueWeb train set, and compare with its top-5 nearest neighbors in ClueWeb train set, based on the similarity of the learned embedding.A case study is shown in Table 3.We can see that the KB relation place of birth often collocates with a preposition in indicating the object fits into a location type, and some key words like born.Together, the sub-structure born in serves as a strong indicator for place of birth relation.There is almost always some redundant information in the textual relations, for example in the textual rela-
−− → nov.does not carry crucial information indicating the target relation.A good textual relation embedding model should be capable of learning to attend to the crucial semantic patterns.
The wrong labeling problem of distant supervision.The Ford Motor Company is both founded by and named after Henry Ford.The KB relation founder and named after are thus both mapped to all of the sentences containing the entity pair, resulting in many wrong labels (red dashed arrows).Right: Global co-occurrence statistics from our distant supervision dataset, which clearly distinguishes the two textual relations.

Table 1 :
, we aim at augmenting existing relation extractors with the textual relation embeddings.We first average the Relation extraction manual evaluation results: Precision of top 1000 predictions.

Table 3 :
Case study: Textual relation embedding model can well generalize to unseen textual relations via capturing common shared sub-structures.