Cross-Sentence N-ary Relation Extraction using Lower-Arity Universal Schemas

Most existing relation extraction approaches exclusively target binary relations, and n-ary relation extraction is relatively unexplored. Current state-of-the-art n-ary relation extraction method is based on a supervised learning approach and, therefore, may suffer from the lack of sufficient relation labels. In this paper, we propose a novel approach to cross-sentence n-ary relation extraction based on universal schemas. To alleviate the sparsity problem and to leverage inherent decomposability of n-ary relations, we propose to learn relation representations of lower-arity facts that result from decomposing higher-arity facts. The proposed method computes a score of a new n-ary fact by aggregating scores of its decomposed lower-arity facts. We conduct experiments with datasets for ternary relation extraction and empirically show that our method improves the n-ary relation extraction performance compared to previous methods.


Introduction
Relation extraction is a core natural language processing task which is concerned with the extraction of relations between entities from text. It has numerous applications ranging from question answering (Xu et al., 2016) to automated knowledge base construction (Dong et al., 2014).
While the vast majority of existing research focuses on extracting binary relations, there exists only few recent approaches to extract n-ary relations, that is, relations among n ≥ 2 entities (Li et al., 2015;Ernst et al., 2018). In n-ary relation extraction, relation mentions tend to span multiple sentences more frequently as n increases. Thus, Peng et al. (2017) recently extended the problem to cross-sentence n-ary relation extraction in which n-ary relations are extracted from multiple sentences. As a motivating example, consider the following text from Wikipedia: "Revis started off the 2009 season matched up against some of football's best wide receivers. In Week 1, he helped limit Houston Texans Pro-bowler Andre Johnson to four receptions for 35 yards." In this example, two sentences collectively describes that Andre Johnson is a player of the football team the Texans during 2009 season, and thus we need cross-sentence information to correctly extract this ternary interaction among the three entities, i.e. Player(Andre Johnson, Texans, 2009 season).
Previous methods (Peng et al., 2017;Song et al., 2018) capture cross-sentence n-ary relation mentions by representing texts with a document graph which consists of both intra-and cross-sentence links between words. With this graphical representation, they applied graph neural networks to predict ternary relations in the medical domain. However, these methods train the neural networks in a supervised manner using distant supervision (Mintz et al., 2009) and, therefore, may suffer from the lack of sufficient positive labels when a well-populated knowledge base is not available.
On the other hand, for binary relation extraction, the problem of insufficient positive labels can be mitigated with universal schemas . In a universal schema approach, textual representations (surface patterns) of entities and their relations are encoded into the same vector space as the canonical knowledge base relations. Thus, semantically similar surface patterns can share information of relation labels in a semisupervised manner. This reduces the amount of required labeled training data. Applying the universal schema approach to n-ary (n > 2) relation extraction is, however, not straight-forward due to the sparsity of higher-order relation mentions among a specific set of n > 2 entities. 1 This is be-cause the universal schema approach  and its extensions (Toutanova et al., 2015;Verga et al., 2016Verga et al., , 2017 utilize co-occurring patterns of relation types between specific pair of entities. Also, prior work has only addressed binary relations, and it is not trivial to define surface patterns among n > 2 entities and to encode these patterns into a vector representation. To mitigate the aforementioned sparsity problem and utilize existing encoders for binary and unary surface patterns, we propose to train universal schema models on more dense lower-arity (unary and binary) facts instead of original sparse n-ary facts. Since most n-ary relations can be decomposed into a set of k-ary relations (k = 1, 2) which are implied by the n-ary relation, 2 we can easily acquire lower-arity facts by decomposing n-ary facts. Our model learns representations of these lower-arity relations using the universal schema framework, and predicts new n-ary facts by aggregating scores of lower-arity facts.
To evaluate the proposed method, we create new cross-sentence n-ary relation extraction datasets with multiple ternary relations. 3 The new datasets contain more entity tuples with known relational facts appeared in a knowledge base than the existing dataset (Peng et al., 2017), and, therefore, these datasets can be used to more effectively evaluate methods which predict relation labels for each individual entity tuple. We show empirically that by jointly training lower-arity models and an nary score aggregation model, the proposed method improves the performance of n-ary relation extraction. To the best of our knowledge, this is the first attempt to apply universal schemas to n-ary relation extraction, taking advantage of the compositionality of higher-arity facts.

Task Definition and Notation
The cross-sentence n-ary relation extraction task (Peng et al., 2017) is defined as follows. Let E be a set of entities, R KB be a set of relation types of an external knowledge base KB, and O KB = { r, (e 1 , ..., e n ) : r(e 1 , ..., e n ) ∈ KB, r ∈ 2 For example, the ternary relation AwardedFor(director, movie, award) can be decomposed into the binary relations DirectorOf(director, movie) and WonAward(director, award). Note that a similar idea is introduced in (Ernst et al., 2018) as partial facts or partial patterns.
3 Our codes and datasets are available at https://github.com/aurtg/ nary-relation-extraction-decomposed.  R KB , e i ∈ E} be the set of facts in KB. We collect a set of candidate entity tuples among which KB relation r ∈ R KB possibly holds. 4 Here, all entities in each candidate tuple (e 1 , ..., e n ) are mentioned in the same text section T in a given set of documents. We define a set of these entity mentions as O text = { T, (e 1 , ..., e n ) : Here, text section T is a (short) span in a document which can describes relational facts among entities. In the cross-sentence n-ary relation extraction task, text section T can contain multiple sentences. In this paper, following (Peng et al., 2017), we define M consecutive sentences (M ≥ 1) which contain n target entities as a text section in the crosssentence n-ary relation extraction task. We use the term "relation" to refer to both relations r ∈ R KB and sections T .
The goal of the cross-sentence n-ary relation extraction task is to predict new facts r, (e 1 , ..
3 Proposed Method

Lower-Arity Facts
To alleviate the sparsity problem of facts among n entities (n > 2) and to utilize well-studied encoders for binary and unary surface patterns, we decompose a set of original n-ary facts, O, into a set of unary facts O 1 and a set of binary facts O 2 (Figure 1).
Unary Facts: Given an n-ary fact r, (e 1 , ..., e k ) ∈ O, we decompose it into a set of n unary facts { r (k) , e k : k = 1, ..., n}, where r (k) is a tentative unary relation w.r.t. the k-th argument of the original relation r. If r is a KB relation, we define unary relation r (k) as a new canonicalized relation. If r is section T , we define unary relation r (k) as a tuple r (k) = (T, pos(e k )), where pos(e k ) is a set of word position indices of entity e k in section T (Figure 2). We denote a set of all decomposed unary facts by O 1 . Intuitively, these unary relations represent semantic roles or types of corresponding arguments of the original relation r .
Binary Facts: Given an n-ary fact r, (e 1 , ..., e k ) ∈ O, we decompose it into a set of n(n − 1) binary facts { r (k,l) , (e k , e l ) : k, l = 1, ..., n, k = l}, where r (k,l) is a tentative binary relation between the k-th and l-th argument of the original relation r. If r is a KB relation, we define binary relation r (k,l) as a new canonicalized relation. If r is a section T , we represent it by the shortest path between e k and e l on the document graph  of T (Figure 2), and denote it by path(T ; e k , e l ). We denote the set of all decomposed binary facts by O 2 .

Lower-Arity Relation Representations
We learn a vector representation v(r) ∈ R dr for each unary or binary relation in O 1 or O 2 . For r (k) or r (k,l) derived from a KB relation, we represent it by a trainable parameter vector. On the other hand, for the one derived from a textual relation, we use the following encoders to compute its representations.
Unary encoder: For an unary textual relation r (k) = (T, pos(e k )), we represent each section T by a sequence of word vectors and use a bidirectional LSTM (Bi-LSTM) (Schuster and Paliwal, 1997) to compute a hidden representation h l ∈ R dr at each word position l. Following recent works He et al., 2018;Lee et al., 2017), we aggregate h l within a phrase of entity e k to compute v(T (k) ). We use elementwise mean as aggregation function: v(r (k) ) = mean({h l : l ∈ pos(e k )}). (1) Binary encoder: For a binary textual relation r (k,l) = path(T ; e k , e l ), we represent each token (word or edge label) in path(T ; e k , e l ) by an embedding vector (Toutanova et al., 2015;Verga et al., 2016). We use a Bi-LSTM to compute a hidden representation h l ∈ R dr at each token position l, and max-pool along the path to compute the relation representation: v(T (k,l) ) = max({h l : l = 1, ..., L}). (2)

Learning Relation Representations
We follow Verga et al. (2017) to train relation representations ( §3.2). We define a score θ r,p for each lower-arity fact r, p ∈ O 1 ∪ O 2 , and minimize the following loss (3) for each arity i = 1, 2.
Here, placeholder p refers to either an entity (if r, p ∈ O 1 ) or an entity tuple (if r, p ∈ O 2 ), and we simply refer to both as entity tuple. The loss functions contrast a score of an original fact r, p + ∈ O i and those of K sampled negative facts r, p − k / ∈ O i . We sample negative facts by randomly replacing entity tuple p + in the original fact by different entity tuples p − k .
(3) The score of fact r, p is defined as θ r,p = v(r) T v(p; r). Entity tuple representations v(p; r) are computed with a weighted average of the representations {v(r ) : r ∈ V (p)} as shown in (4) and (5) where a(r , r; V (p)) is the attention weight for each relation r ∈ V (p). 5 v(p; r) = r ∈V (p) a(r , r; V (p))v(r ), . (4) . (5)

Aggregating Lower-Arity Scores
To predict n-ary facts of KB relation r ∈ R KB , we compute its score θ r,(e 1 ,...,en) by aggregating lower-arity scores as in (6), where w (·) r is a positive scalar weight defined for each KB relation which sum to one: k w (k) We can set all weights w (k) r and w (k,l) r to 1/n 2 , or train these weights to give higher scores to positive n-ary facts by minimizing additional loss function L n . Note that L n directly contrasts n-ary scores associated with KB relations r ∈ R KB in a more supervised manner than both L 1 and L 2 . 6 θ r,(e 1 ,...,en) = n k=1 w (k) r θ r (k) ,e k + k=1,...,n l=1,...,n k =l w (k,l) r θ r (k,l) ,(e k ,e l ) . (6) The overall loss function is now L = L 1 + L 2 + αL n . By changing α, we can balance the semisupervised effect of lower-arity universal schemas (L 1 , L 2 ) and that of the supervision with n-ary relation labels (L n ).

Dataset
The cross-sentence n-ary relation extraction dataset from Peng et al. (2017) contains only 59 distinct ternary KB facts including the train and test set. Since our proposed method and universal schemas baselines predict KB relations for each entity tuple instead of each surface pattern, the number of known facts of KB relations is crucial to reliably evaluate and compare these methods. Thus, we created two new n-ary cross-sentence relation extraction datasets (dubbed with Wiki-90k and WF-20k) that contain more known facts retrieved from public knowledge bases.
To create the Wiki-90k and WF-20k datasets, we used Wikidata and Freebase respectively as external knowledge bases. Since these knowledge bases store only binary relational facts, we defined multiple ternary relations by combining a few binary relations. 7,8 For both datasets, we collected paragraphs from the English Wikipedia, and used Stanford CoreNLP (Manning et al., 2014) to extract dependency and co-reference links. Entity mentions are detected using DBpedia Spotlight (Daiber et al., 2013). We followed (Peng et al., 2017) to extract co-occurring entity tuples and their surface patterns, that is, we selected tuples which occurred in a minimal span within at most M ≤ 3 consecutive sentences. Entity tuples without a known KB relation are subsampled, since the number of such tuples are too large. We randomly partitioned all entity tuples into train, development (dev), and test sets. (Song et al., 2018): The state-of-the-art crosssentence n-ary relation extraction method proposed by Song et al. (2018) represents each surface pattern by the concatenation of entity vectors from the last layer of a Graph State LSTM, a variant of a graph neural network. The concatenated vector is then fed into a classifier to predict the relation label. Since their method directly predicts a relation label for each surface pattern, it is more robust to the sparsity of surface patterns among a specific higher arity entity tuple. However, due to their purely supervised training objective, its performance may degrade if the number of available training labels is small.

Baselines
Universal schemas: We compared our method with semi-supervised methods based on universal schemas (Toutanova et al., 2015;Verga et al., 2017). In our experiments, we used the same encoder as (Song et al., 2018) to encode each surface pattern. 9 We tested two types of scoring functions, Model F and Model E, as in (Toutanova et al., 2015). 10,11

Evaluation
We compared the methods in the held-out evaluation as in (Mintz et al., 2009) and report (weighted) mean average precision (MAP) . Unless otherwise noted, reported values are average values over six experiments, in which network parameters are randomly initialized. All reported p-values are calculated based on Wilcoxon rank sum test (Wilcoxon, 1945) with 9 Using linear projection instead of simple concatenation did not improve performance in our preliminary experiments.
10 It is not trivial to apply Model E scoring function with Verga et al. (2017) method, since their aggregation method calculates a representation for each row, i.e. entity tuple. 11 DistMult scoring function in (Toutanova et al., 2015) showed poor performance compared to the other two scoring functions in our preliminary experiments.

Method
Wiki-90k WF-20k average weighted average weighted Proposed 0.584 0.634 0.821 0.842 (Song et al., 2018) 0.471 0.536 0.639 0.680 (Toutanova et al., 2015) with Graph State LSTM (Song et al., 2018) (Verga et al., 2017) with Graph State LSTM (Song et al., 2018)  multiple-test adjustment using Holm's method (Holm, 1979).  Table 2 illustrates the performance of various settings of our proposed method. U,B, and N stand for using the loss functions L 1 , L 2 , and αL n respectively. In the result, U+B performs significantly better (p < 0.005) than U and B, and this shows effectiveness of combining scores of both binary facts and unary facts. On the other hand, there was no significant difference between U+B+N and N (p > 0.9). Note that we used all positive labels in this experiment, that is, sufficient amount of positive labels are used for calculating the loss N.

Results
Data efficiency: Furthermore, we also investigated the influence of the training data size (the number of positive labels) of our proposed method and baseline methods. 13 Here, α = ∞ stands for optimizing L n instead of L 1 + L 2 + αL n . As shown in Figure 3, α = 1 achieved higher performance than α = ∞, showing that introducing lower-arity semi-supervised loss (L 1 +L 2 ) improves the performance for dataset with few positive labels. On the other hand, the lower performance of α = 0 compared to α = 0.1, 1 suggests that information of higher-arity facts introduced from L n is benefitial for n-ary relation extraction.

Conclusion and Future Works
We proposed a new method for cross-sentence nary relation extraction that decomposes sparse n-12 For the proposed method, we set α = 10. 13 In this experiment, we conducted four experiments per each setting and set K = 10.   ary facts into dense unary and binary facts. Experiments on two datasets with multiple ternary relations show that our proposed method can statistically significantly improve over previous works, which suggests the effectiveness of using unary and binary interaction among entities in surface patterns.
However, as Fatemi et al. (2019) suggests, there exists cases in which reconstructing n-ary facts from decomposed binary facts induces false positives. Tackling this issue is one important future research direction.