Simultaneously Self-Attending to All Mentions for Full-Abstract Biological Relation Extraction

Most work in relation extraction forms a prediction by looking at a short span of text within a single sentence containing a single entity pair mention. This approach often does not consider interactions across mentions, requires redundant computation for each mention pair, and ignores relationships expressed across sentence boundaries. These problems are exacerbated by the document- (rather than sentence-) level annotation common in biological text. In response, we propose a model which simultaneously predicts relationships between all mention pairs in a document. We form pairwise predictions over entire paper abstracts using an efficient self-attention encoder. All-pairs mention scores allow us to perform multi-instance learning by aggregating over mentions to form entity pair representations. We further adapt to settings without mention-level annotation by jointly training to predict named entities and adding a corpus of weakly labeled data. In experiments on two Biocreative benchmark datasets, we achieve state of the art performance on the Biocreative V Chemical Disease Relation dataset for models without external KB resources. We also introduce a new dataset an order of magnitude larger than existing human-annotated biological information extraction datasets and more accurate than distantly supervised alternatives.


Introduction
With few exceptions (Swampillai and Stevenson, 2011;Peng et al., 2017), nearly all work in relation extraction focuses on classifying a short span of text within a single sentence containing a single entity pair mention. However, relationships between entities are often expressed across sentence boundaries or otherwise require a larger context to disambiguate. For example, 30% of relations in the Biocreative V CDR dataset ( §3.1) are expressed across sentence boundaries, such as in the following excerpt expressing a relationship between the chemical azathioprine and the disease fibrosis: Treatment of psoriasis with azathioprine. Azathioprine treatment benefited 19 (66%) out of 29 patients suffering from severe psoriasis. Haematological complications were not troublesome and results of biochemical liver function tests remained normal. Minimal cholestasis was seen in two cases and portal fibrosis of a reversible degree in eight. Liver biopsies should be undertaken at regular intervals if azathioprine therapy is continued so that structural liver damage may be detected at an early and reversible stage.
Though the entities' mentions never occur in the same sentence, the above example expresses that the chemical entity azathioprine can cause the side effect fibrosis. Relation extraction models which consider only within-sentence relation pairs cannot extract this fact without knowledge of the complicated coreference relationship between eight and azathioprine treatment, which, without features from a complicated pre-processing pipeline, cannot be learned by a model which considers entity pairs in isolation. Making separate predictions for each mention pair also obstructs multi-instance learning (Riedel et al., 2010;Surdeanu et al., 2012), a technique which aggregates entity representations from mentions in order to improve robustness to noise in the data. Like the majority of relation extraction data, most annotation for biological relations is distantly supervised, and so we could benefit from a model which is amenable to multi-instance learning.
In addition to this loss of cross-sentence and cross-mention reasoning capability, traditional mention pair relation extraction models typically introduce computational inefficiencies by independently extracting features for and scoring every pair of mentions, even when those mentions occur in the same sentence and thus could share representations. In the CDR training set, this requires separately encoding and classifying each of the 5,318 candidate mention pairs independently, versus encoding each of the 500 abstracts once. Though abstracts are longer than e.g. the text between mentions, many sentences contain multiple mentions, leading to redundant computation.
However, encoding long sequences in a way which effectively incorporates long-distance context can be prohibitively expensive. Long Short Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) are among the most popular token encoders due to their capacity to learn high-quality representations of text, but their ability to leverage the fastest computing hardware is thwarted due to their computational dependence on the length of the sequence -each token's representation requires as input the representation of the previous token, limiting the extent to which computation can be parallelized. Convolutional neural networks (CNNs), in contrast, can be executed entirely in parallel across the sequence, but the amount of context incorporated into a single token's representation is limited by the depth of the network, and very deep networks can be difficult to learn (Hochreiter, 1998). These problems are exacerbated by longer sequences, limiting the extent to which previous work explored full-abstract relation extraction.
To facilitate efficient full-abstract relation extraction from biological text, we propose Bi-affine Relation Attention Networks (BRANs), a combination of network architecture, multi-instance and multi-task learning designed to extract relations between entities in biological text without requiring explicit mention-level annotation. We synthesize convolutions and self-attention, a modification of the Transformer encoder introduced by Vaswani et al. (2017), over sub-word tokens to efficiently incorporate into token representations rich context between distant mention pairs across the entire abstract. We score all pairs of mentions in parallel using a bi-affine operator, and aggregate over mention pairs using a soft approximation of the max function in order to perform multi-instance learning. We jointly train the model to predict relations and entities, further improving robustness to noise and lack of gold annotation at the mention level.
In extensive experiments on two benchmark biological relation extraction datasets, we achieve state of the art performance for a model using no external knowledge base resources in experiments on the Biocreative V CDR dataset, and outperform comparable baselines on the Biocreative VI ChemProt dataset. We also introduce a new dataset which is an order of magnitude larger than existing goldannotated biological relation extraction datasets while covering a wider range of entity and relation types and with higher accuracy than distantly supervised datasets of the same size. We provide a strong baseline on this new dataset, and encourage its use as a benchmark for future biological relation extraction systems. 1

Model
We designed our model to efficiently encode long contexts spanning multiple sentences while forming pairwise predictions without the need for mention pair-specific features. To do this, our model first encodes input token embeddings using self-attention. These embeddings are used to predict both entities and relations. The relation extraction module converts each token to a head and tail representation. These representations are used to form mention pair predictions using a bi-affine operation with respect to learned relation embeddings. Finally, these mention pair predictions are pooled to form entity pair predictions, expressing whether each relation type is expressed by each relation pair.

Inputs
Our model takes in a sequence of N token embeddings in R d . Because the Transformer has no innate notion of token position, the model relies on positional embeddings which are added to the input token embeddings. 2 We learn the position embedding matrix P m×d which contains a separate d dimensional embedding for each position, limited to m possible positions. Our final input representation for token x i is: where s i is the token embedding for x i and p i is the positional embedding for the ith position. If i exceeds m, we use a randomly initialized vector in place of p i .
We tokenize the text using byte pair encoding (BPE) (Gage, 1994;Sennrich et al., 2015). The BPE algorithm constructs a vocabulary of sub-word pieces, beginning with single characters. Then, the algorithm iteratively merges the most frequent cooccurring tokens into a new token, which is added to the vocabulary. This procedure continues until a pre-defined vocabulary size is met.
BPE is well suited for biological data for the following reasons. First, biological entities often have unique mentions made up of meaningful subcomponents, such as 1,2-dimethylhydrazine. Additionally, tokenization of chemical entities is challenging, lacking a universally agreed upon algorithm (Krallinger et al., 2015). As we demonstrate in §3.3.2, the subword representations produced by BPE allow the model to formulate better predictions, likely due to better modeling of rare and unknown words.  Inputs are contextually encoded using the Transformer (Vaswani et al., 2017), made up of B layers of multi-head attention and convolution subcomponents. Each transformed token is then passed through a head and tail MLP to produce two position-specific representations. A bi-affine operation is performed between each head and tail representation with respect to each relation's embedding matrix, producing a pair-wise relation affinity tensor. Finally, the scores for cells corresponding to the same entity pair are pooled with a separate LogSumExp operation for each relation to get a final score. The colored tokens illustrate calculating the score for a given pair of entities; the model is only given entity information when pooling over mentions.

Transformer
We base our token encoder on the Transformer self-attention model (Vaswani et al., 2017). The Transformer is made up of B blocks. Each Transformer block, which we denote Transformer k , has its own set of parameters and is made up of two subcomponents: multi-head attention and a series of convolutions 3 . The output for token i of block k,

Multi-head Attention
Multi-head attention applies self-attention multiple times over the same inputs using separately normalized parameters (attention heads) and combines the results, as an alternative to applying one pass of attention with more parameters. The intuition behind this modeling decision is that dividing the attention into multiple heads make it easier for the model to learn to attend to different types of relevant information with each head. The self-attention updates input b (k−1) i by performing a weighted sum over all tokens in the sequence, weighted by their importance for modeling token i.
Each input is projected to a key k, value v, and query q, using separate affine transformations with ReLU activations (Glorot et al., 2011). Here, k, v, and q are each in R d H where H is the number of heads. The attention weights a ijh for head h between tokens i and j are computed using scaled dot-product attention: with denoting element-wise multiplication and σ indicating a softmax along the jth dimension. The scaled attention is meant to aid optimization by flattening the softmax and better distributing the gradients (Vaswani et al., 2017). The outputs of the individual attention heads are concatenated, denoted [·; ·], into o i . All layers in the network use residual connections between the output of the multi-headed attention and its input. Layer normalization (Ba et al., 2016), denoted LN(·), is then applied to the output.

Convolutions
The second part of our Transformer block is a stack of convolutional layers. The sub-network used in Vaswani et al. (2017) uses two width-1 convolutions. We add a third middle layer with kernel width 5, which we found to perform better. Many relations are expressed concisely by the immediate local context, e.g. Michele's husband Barack, or labetalol-induced hypotension. Adding this explicit n-gram modeling is meant to ease the burden on the model to learn to attend to local features. We use C w (·) to denote a convolutional operator with kernel width w. Then the convolutional portion of the transformer block is given by:

Bi-affine Pairwise Scores
We project each contextually encoded token b (B) i through two separate MLPs to generate two new versions of each token corresponding to whether it will serve as the first (head) or second (tail) argument of a relation: We use a bi-affine operator to calculate an N ×L×N tensor A of pairwise affinity scores, scoring each (head, relation, tail) triple: where L is a d × L × d tensor, a learned embedding matrix for each of the L relations. In subsequent sections we will assume we have transposed the dimensions of A as d × d × L for ease of indexing.

Entity Level Prediction
Our data is weakly labeled in that there are labels at the entity level but not the mention level, making the problem a form of strong-distant supervision (Mintz et al., 2009). In distant supervision, edges in a knowledge graph are heuristically applied to sentences in an auxiliary unstructured text corpus -often applying the edge label to all sentences containing the subject and object of the relation. Because this process is imprecise and introduces noise into the training data, methods like multiinstance learning were introduced (Riedel et al., 2010;Surdeanu et al., 2012). In multi-instance learning, rather than looking at each distantly labeled mention pair in isolation, the model is trained over the aggregate of these mentions and a single update is made. More recently, the weighting function of the instances has been expressed as neural network attention (Verga and McCallum, 2016;Lin et al., 2016;Yaghoobzadeh et al., 2017).
We aggregate over all representations for each mention pair in order to produce per-relation scores for each entity pair. For each entity pair (p head , p tail ), let P head denote the set of indices of mentions of the entity p head , and let P tail denote the indices of mentions of the entity p tail . Then we use the LogSumExp function to aggregate the relation scores from A across all pairs of mentions of p head and p tail : The LogSumExp scoring function is a smooth approximation to the max function and has the benefits of aggregating information from multiple predictions and propagating dense gradients as opposed to the sparse gradient updates of the max (Das et al., 2017).

Named Entity Recognition
In addition to pairwise relation predictions, we use the Transformer output b (B) i to make entity type predictions. We feed b (B) i as input to a linear classifier which predicts the entity label for each token with per-class scores c i : We augment the entity type labels with the BIO encoding to denote entity spans. We apply tags to the byte-pair tokenization by treating each subword within a mention span as an additional token with a corresponding B-or I-label.

Training
We train both the NER and relation extraction components of our network to perform multi-class classification using maximum likelihood, where NER classes y i or relation classes r i are conditionally independent given deep features produced by our model with probabilities given by the softmax function. In the case of NER, features are given by the per-token output of the transformer: In the case of relation extraction, the features for each entity pair are given by the LogSumExp over pairwise scores described in § 2.4. For E entity pairs, the relation r i is given by: We train the NER and relation objectives jointly, sharing all embeddings and Transformer parameters. To trade off the two objectives, we penalize the named entity updates with a hyperparameter λ.

Results
We evaluate our model on three datasets: The Biocreative V Chemical Disease Relation benchmark (CDR), which models relations between chemicals and diseases ( §3.1); the Biocreative VI ChemProt benchmark (CPR), which models relations between chemicals and proteins ( §3.2); and a new, large and accurate dataset we describe in §3.3 based on the human curation in the Chemical Toxicology Database (CTD), which models relationships between chemicals, proteins and genes.
The CDR dataset is annotated at the level of paper abstracts, requiring consideration of longrange, cross sentence relationships, thus evaluation on this dataset demonstrates that our model is capable of such reasoning. We also evaluate our model's performance in the more traditional setting which does not require cross-sentence modeling by performing experiments on the CPR dataset, for which all annotations are between two entity mentions in a single sentence. Finally, we present a new dataset constructed using strong-distant supervision ( §2.4), with annotations at the document level. This dataset is significantly larger than the others, contains more relation types, and requires reasoning across sentences.

Chemical Disease Relations Dataset
The Biocreative V chemical disease relation extraction (CDR) dataset 4 (Li et al., 2016a; was derived from the Comparative Toxicogenomics Database (CTD), which curates interactions between genes, chemicals, and diseases (Davis et al., 2008). CTD annotations are only at the document level and do not contain mention annotations. The CDR dataset is a subset of these original annotations, supplemented with human annotated, entity linked mention annotations. The relation annotations in this dataset are also at the document level only.

Data Preprocessing
The CDR dataset is concerned with extracting only chemically-induced disease relationships (drugrelated side effects and adverse reactions) concerning the most specific entity in the document. For example tobacco causes cancer could be marked as false if the document contained the more specific lung cancer. This can cause true relations to be labeled as false, harming evaluation performance. To address this we follow (Gu et al., 2016(Gu et al., , 2017 and filter hypernyms according to the hierarchy in the MESH controlled vocabulary 5 . All entity pairs within the same abstract that do not have an annotated relation are assigned the NULL label. In addition to the gold CDR data,  add 15,448 PubMed abstracts annotated in the CTD dataset. We consider this same set of abstracts as additional training data (which we subsequently denote +Data). Since this data does not contain entity annotations, we take the annotations from Pubtator (Wei et al., 2013), a state of the art biological named entity tagger and entity linker. See §A.1 for additional data processing details. In our experiments we only evaluate our relation extraction performance and all models (including baselines) use gold entity annotations for predictions.
The byte pair vocabulary is generated over the training dataset -we use a budget of 2500 tokens when training on the gold CDR data, and a larger budget of 10,000 tokens when including extra data described above Additional implementation details are included in Appendix A.

Baselines
We compare against the previous best reported results on this dataset not using knowledge base features. 6 Each of the baselines are ensemble methods for within-and cross-sentence relations that make use of additional linguistic features (syntactic parse and part-of-speech). Gu et al. (2017) encode mention pairs using a CNN while Zhou et al. (2016a) use an LSTM. Both make cross-sentence predictions with featurized classifiers.

Results
In Table 2 we show results outperforming the baselines despite using no linguistic features. We show performance averaged over 20 runs with 20 random seeds as well as an ensemble of their averaged predictions. We see a further boost in performance by adding weakly labeled data.   effects of ablating pieces of our model. 'CNN only' removes the multi-head attention component from the transformer block, 'no width-5' replaces the width-5 convolution of the feed-forward component of the transformer with a width-1 convolution and 'no NER' removes the named entity recognition multi-task objective ( §2.5).

Chemical Protein Relations Dataset
To assess our model's performance in settings where cross-sentence relationships are not explicitly evaluated, we perform experiments on the Biocreative VI ChemProt dataset (CDR) (Krallinger et al., 2017). This dataset is concerned with classifying into six relation types between chemicals and proteins, with nearly all annotated relationships occurring within the same sentence.

Baselines
We compare our models against those competing in the official Biocreative VI competition (Liu et al., 2017). We compare to the top performing team whose model is directly comparable with ours -i.e. used a single (non-ensemble) model trained only on the training data (many teams use the development set as additional training data). The baseline models are standard state of the art relation extraction models: CNNs and Gated RNNs with attention. Each of these baselines uses mention-specific features encoding relative position of each token to the two target entities being classified, whereas our model aggregates over all mention pairs in each sentence. It is also worth noting that these models use a large vocabulary of pre-trained word embeddings, giving their models the advantage of far more model parameters, as well as additional information from   (2017) unsupervised pre-training.

Results
In Table 4 we see that even though our model forms all predictions simultaneously between all pairs of entities within the sentence, we are able to outperform state of the art models classifying each mention pair independently. The scores shown are averaged across 10 runs with 10 random seeds. Interestingly, our model appears to have higher recall and lower precision, while the baseline models are both precision-biased, with lower recall. This suggests that combining these styles of model could lead to further gains on this task.

Data
Existing biological relation extraction datasets including both CDR ( §3.1) and CPR ( §3.2) are relatively small, typically consisting of hundreds or a few thousand annotated examples. Distant supervision datasets apply document-independent, entitylevel annotations to all sentences leading to a large proportion of incorrect labels. Evaluations on this data involve either very small (a few hundred) gold annotated examples or cross validation to predict the noisy, distantly applied labels (Mallory et al., 2015;Peng et al., 2017). We address these issues by constructing a new dataset using strong-distant supervision containing document-level annotations. The Comparative Toxicogenomics Database (CTD) curates interactions between genes, chemicals, and diseases. Each relation in the CTD is associated with a disambiguated entity pair and a PubMed article where the relation was observed.
To construct this dataset, we collect the abstracts for each of the PubMed articles with at least one curated relation in the CTD database. As in §3.1, we use PubTator to automatically tag and disambiguate the entities in each of these abstracts. If both entities in the relation are found in the abstract, we take the (abstract, relation) pair as a positive example. The evidence for the curated relation could occur anywhere in the full text article, not just the abstract. Abstracts with no recovered relations are discarded. All other entity pairs with valid types and without an annotated relation that  occur in the remaining abstracts are considered negative examples and assigned the NULL label. We additionally remove abstracts containing greater than 500 tokens 7 . This limit removed about 10% of the total data including numerous extremely long abstracts. The average token length of the remaining data was230 tokens. With this procedure, we are able to collect 166,474 positive examples over 13 relation types, with more detailed statistics of the dataset listed in Table 5. We consider relations between chemical-disease, chemical-gene, and gene-disease entity pairs downloaded from CTD 8 . We remove inferred relations (those without an associated PubMed ID) and consider only human curated relationships. Some chemical-gene entity pairs were associated with multiple relation types in the same document. We consider each of these relation types as a separate positive example.
The chemical-gene relation data contains over 100 types organized in a shallow hierarchy. Many of these types are extremely infrequent, so we map all relations to the highest parent in the hierarchy, resulting in 13 relation types. Most of these chemical-gene relations have an increase and decrease version such as increase_expression and de-crease_expression. In some cases, there is also an affects relation (affects_expression) which is used when the directionality is unknown. If the affects version is more common, we map decrease and increase to affects. If affects is less common, we drop the affects examples and keep the increase and decrease examples as distinct relations, resulting in the final set of 10 chemical-gene relation types.

Results
In Table 7 we list precision, recall and F1 achieved by our model on the CTD dataset, both overall and by relation type. Our model predicts each of the relation types effectively, with higher performance on relations with more support.
In Table 8 we see that our sub-word BPE model out-performs the model using the Genia tokenizer (Kulick et al., 2012) even though our vocabulary size is one-fifth as large. We see a 1.7 F1 point boost in predicting Pubtator NER labels for BPE. This could be explained by the increased out-of- 7 We include scripts to generate the unfiltered set of data as well to encourage future research 8 http://ctdbase.org/downloads/  vocabulary (OOV) rate for named entities. Word training data has 3.01 percent OOV rate for tokens with an entity. The byte pair-encoded data has an OOV rate of 2.48 percent. Note that in both the word-tokenized and byte pair-tokenized data, we replace tokens that occur less than five times with a learned UNK token. Figure 2 depicts the model's performance on relation extraction as a function of distance between entities. For example, the blue bar depicts performance when removing all entity pair candidates (positive and negative) whose closest mentions are more than 11 tokens apart. We consider removing entity pair candidates with distances of 11, 25, 50, 100 and 500 (the maximum document length). The average sentence length is 22 tokens. We see that the model is not simply relying on short range relationships, but is leveraging information about distant entity pairs, with accuracy increasing as the maximum distance considered increases. Note that all results are taken from the same model trained on the full unfiltered training set.

Related work
Relation extraction is a heavily studied area in the NLP community. Most work focuses on news and web data (Doddington et al., 2004;Riedel et al., 2010;Hendrickx et al., 2009)   work approaches to relation extraction have focused on CNNs (dos Santos et al., 2015;Zeng et al., 2015) or LSTMs (Miwa and Bansal, 2016;Verga et al., 2016a;Zhou et al., 2016b) and replacing stage-wise information extraction pipelines with a single endto-end model (Miwa and Bansal, 2016;Ammar et al., 2017;Li et al., 2017). These models all consider mention pairs separately. There is also a considerable body of work specifically geared towards supervised biological relation extraction including protein-protein (Pyysalo et al., 2007;Poon et al., 2014;Mallory et al., 2015), drugdrug (Segura-Bedmar et al., 2013), and chemicaldisease (Gurulingappa et al., 2012;Li et al., 2016a) interactions, and more complex events (Kim et al., 2008;Riedel et al., 2011). Our work focuses on modeling relations between chemicals, diseases, genes and proteins, where available annotation is often at the document-or abstract-level, rather than the  Figure 2: Performance on the CTD dataset when restricting candidate entity pairs by distance. The x-axis shows the coarse-grained relation type. The y-axis shows F1 score. Different colors denote maximum distance cutoffs.
sentence level.
Some previous work exists on cross-sentence relation extraction. Swampillai and Stevenson (2011) and  consider featurized classifiers over cross-sentence syntactic parses. Most similar to our work is that of Peng et al. (2017), which uses a variant of an LSTM to encode document-level syntactic parse trees. Our work differs in three key ways. First, we operate over raw tokens negating the need for part-of-speech or syntactic parse features which can lead to cascading errors. We also use a feed-forward neural architecture which encodes long sequences far more efficiently compared to the graph LSTM network of Peng et al. (2017). Finally, our model considers all mention pairs simultaneously rather than a single mention pair at a time.
We employ a bi-affine function to form pairwise predictions between mentions. Such models have also been used for knowledge graph link prediction (Nickel et al., 2011;, with variations such as restricting the bilinear relation matrix to be diagonal (Yang et al., 2015) or diagonal and complex (Trouillon et al., 2016). Our model is similar to recent approaches to graph-based dependency parsing, where bilinear parameters are used to score head-dependent compatibility (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017).

Conclusion
We present a bi-affine relation attention network that simultaneously scores all mention pairs within a document. Our model performs well on three datasets, including two standard benchmark biological relation extraction datasets and a new, large and high-quality dataset introduced in this work. Our model out-performs the previous state of the art on the Biocreative V CDR dataset despite us-ing no additional linguistic resources or mention pair-specific features.
Our current model predicts only into a fixed schema of relations given by the data. However, this could be ameliorated by integrating our model into open relation extraction architectures such as Universal Schema (Riedel et al., 2013;Verga et al., 2016b). Our model also lends itself to other pairwise scoring tasks such as hypernym prediction, co-reference resolution, and entity resolution. We will investigate these directions in future work.

A Implementation Details
The model is implemented in Tensorflow (Abadi et al., 2015) and trained on a single TitanX gpu. The number of transformer block repeats is B = 2 . We optimize the model using Adam (Kingma and Ba, 2015) with best parameters chosen for , β 1 , β 2 chosen from the development set. The learning rate is set to 0.0005 and batch size 32. In all of our experiments we set the number of attention heads to h = 4.
We clip the gradients to norm 10 and apply noise to the gradients (Neelakantan et al., 2015). We tune the decision threshold for each relation type separately and perform early stopping on the development set. We apply dropout (Srivastava et al., 2014) to the input layer randomly replacing words with a special UNK token with keep probability .85. We additionally apply dropout to the input T (word embedding + position embedding), interior layers, and final state. At each step, we randomly sample a positive or negative (NULL class) minibatch with probability 0.5.

A.1 Chemical Disease Relations Dataset
Token embeddings are pre-trained using skipgram (Mikolov et al., 2013) over a random subset of 10% of all PubMed abstracts with window size 10 and 20 negative samples. We merge the train and development sets and randomly take 850 abstracts for training and 150 for early stopping. Our reported results are averaged over 10 runs and using different splits. All baselines train on both the train and development set. Models took between 4 and 8 hours to train.
was set to 1e-4, β 1 to .1, and β 2 to 0.9. Gradient noise η = .1. Dropout was applied to the word embeddings with keep probability 0.85, internal layers with 0.95 and final bilinear projection with 0.35 for the standard CRD dataset experiments. When adding the additional weakly labeled data: word embeddings with keep probability 0.95, internal layers with 0.95 and final bilinear projection with 0.5.

A.2 Chemical Protein Relations Dataset
We construct our byte-pair encoding vocabulary using a budget of 7500. The dataset contains annotations for a larger set of relation types than are used in evaluation. We train on only the relation types in the evaluation set and set the remaining types to the Null relation. The embedding dimension is set to 200 and all embeddings are randomly initialized. was set to 1e-8, β 1 to .1, and β 2 to 0.9. Gradient noise η = 1.0. Dropout was applied to the word embeddings with keep probability 0.5, internal layers with 1.0 and final bilinear projection with 0.85 for the standard CRD dataset experiments.

A.3 Full CTD Dataset
We tune separate decision boundaries for each relation type on the development set. For each prediction, the relation type with the maximum probability is assigned. If the probability is below the relation specific threshold, the prediction is set to NULL. We use embedding dimension 128 with all embeddings randomly initialized. Our byte pair encoding vocabulary is constructed with a budget of 50,000. Models took 1 to 2 days to train.
was set to 1e-4, β 1 to .1, and β 2 to 0.9. Gradient noise η = .1.Dropout was applied to the word embeddings with keep probability 0.95, internal layers with 0.95 and final bilinear projection with 0.5