Systematic Comparison of Neural Architectures and Training Approaches for Open Information Extraction

The goal of open information extraction (OIE) is to extract facts from natural language text, and to represent them as structured triples of the form (cid:104) subject , predicate , object (cid:105) . For example, given the sentence » Beethoven composed the Ode to Joy. «, we are expected to extract the triple (cid:104) Beethoven , composed , Ode to Joy (cid:105) . In this work, we systematically compare different neural network architectures and training approaches, and improve the performance of the currently best models on the OIE16 benchmark (Stanovsky and Da-gan, 2016) by 0 . 421 F 1 score and 0 . 420 AUC-PR, respectively, in our experiments (i.e., by more than 200% in both cases). Furthermore, we show that appropriate problem and loss formulations often affect the performance more than the network architecture.


Introduction
The field of information extraction (IE) focuses on the automatic acquisition of information from text in a structured format (Jurafsky and Martin, 2009;Niklaus et al., 2018), and methods from this field are used to automatically collect desired information from large corpora of text. Traditionally, IE methods were developed for specific domains with homogeneous corpora and fixed sets of predicates and entities (Niklaus et al., 2018). Such methods are unable to generalize beyond their domains, as they are limited by their predefined collections of entities and predicates.
The area of open information extraction (OIE) was introduced as an alternative approach to IE (Banko et al., 2007), where predicates and entities are not restricted to a specific domain. More formally, the aim of an OIE algorithm is to extract all facts entailed by the input text in the format of subject, predicate, object triples. Alternative formulations allow for longer tuples, however, most of the work (including ours) focuses on binary predicates * * Equal contribution. only. Given a sentence »Sam succeeded in convincing John.«, an OIE system should extract a tuple: Sam, succeeded in convincing, John . The relation phrase »succeeded in convincing« indicates the semantic relationship between »Sam« and »John«. OIE plays a key role in several downstream natural language processing (NLP) applications, such as knowledge base construction from text (Soderland et al., 2010), question answering (Fader et al., 2014), information retrieval (Etzioni, 2011), text comprehension, and natural language understanding (Mausam, 2016).
In this work, we focus on the extraction of binary relations, and introduce several novel neural approaches to OIE. We construct our models from three blocks, namely, an embedding block, an encoding block, and a prediction block, and introduce several possible implementations of each of them. Furthermore, we exhaustively test all possible combinations of their use, and show that output encoding and loss function strongly influence the results. To that end, we introduce and test three different training schemes, and investigate their influence on the performance of a model.
All experiments have been conducted on the OIE16 benchmark by . As part of the systematic architecture search, several existing neural architectures for OIE (Stanovsky et al., 2018;Zhan and Zhao, 2019;Cui et al., 2018) have been tested and compared under equal conditions.
The main contributions of this work are: • We introduce and compare different possible training schemes for OIE as a sequence tagging problem. • We provide a large-scale study of existing and new OIE models, and compare them under the various introduced training schemes. • We obtain our best results with a novel model based on transformers (Vaswani et al., 2017) and long short-term memories (LSTMs), and improve the current state of the art on the OIE16 benchmark (Stanovsky et al., 2018) by 0.421 F 1 score and 0.420 AUC-PR, respectively, i.e., by more than 200% in both cases.

Related Work
TextRunner (Banko et al., 2007) stands out as the first OIE system. After TextRunner, a number of OIE systems have been developed, such as Clau-seIE (Corro and Gemulla, 2013), ProPS , and OpenIE4 (Angeli et al., 2015). These systems are mostly rule-based and use the language syntax to extract triples from sentences. For example, ClauseIE makes use of linguistic knowledge of the grammar of a sentence to detect clauses, and to identify the type of any such. Ope-nIE4 uses semantic role labeling to extract tuples from a sentence, and PropS relies on the proposition structure of syntax and dependency trees of the input sentence. However, rule-based systems suffer from severe error propagation when applied to examples outside of expected language patterns (Cui et al., 2018).
Recently, several approaches based on neural network (NN) architectures have been introduced (Stanovsky et al., 2018;Cui et al., 2018;Zhan and Zhao, 2019;Jia and Xiang, 2019). As one of the first neural methods, Stanovsky and Dagan (2016) treat OIE as a sequence-tagging task, and use a bidirectional long short-term memory (BiLSTM) architecture with a softmax layer to predict a BIO tag (Ratinov and Roth, 2009), as defined below, for each word in the input sentence. In contrast to this, Cui et al. (2018) consider OIE as a problem of neural machine-translation (NMT) from English into a triple, and approach it with an attention-based sequence-to-sequence model (Bahdanau et al., 2015). Jia and Xiang (2019) use the same sequence-tagging problem formulation as Stanovsky et al. (2018) and extend their work with systematic tests of various NN architectures, such as (Bi)LSTMs, convolutional neural networks (CNNs), and their combinations with conditional random fields (CRFs). Finally, they develop a hybrid model that combines BiLSTMs, CNNs, and CRFs to achieve a maximal performance.
Zhan and Zhao (2019) introduce a span model for n-ary OIE to replace the previously adopted sequence-tagging formulation. Their BiLSTMbased model finds predicate spans separately and uses a separate BiLSTM module to find arguments (entities), given a predicate as an input.

Task Formulation
As defined earlier, we aim to extract binary relations from single sentences. Each fact is represented as a triple subject, predicate, object and elements have to be non-overlapping contiguous phrases in the sentence.
Following Stanovsky et al. (2018), we treat OIE as a sequence-tagging task, where each word is labelled with a BIO tag. To that end, each element of a triple is implicitly extracted by labeling the according sequence of words with a Beginning-tag followed by an arbitrary number of Inside-tags, and words that are not part of any extracted phrase are marked with the Outsidetag. Tags are prefixed with A0 if they refer to a triple's subject, P for the predicate, and A1 to refer to the object of the extracted triple. We use a total of seven different labels, denoted as L = {A0-B, A0-I, P-B, P-I, A1-B, A1-I, O}, and Figure 1 illustrates how they are used to extract triples from text.
More recently, Zhan and Zhao (2019) introduced an alternative formulation that is based on spans rather than sequence tags. A span is a subsequence of the sentence and an extracted triple is a set of three disjunct spans, corresponding to subject, predicate, and object. They generate all possible candidate triples and use a model to score them. To restrict the exponentially growing number of possible triples, additional restrictions are put into place, such as maximum length or syntactic requirements (cf. Zhan and Zhao (2019).
Unlike its formal counterpart, semantic parsing, the task of OIE is defined more vaguely, and hence leaves some space for interpretation. For instance, one could generally consider both Ludwig van Beethoven, was, a world-famous composer and Ludwig van Beethoven, was, a world-famous composer of classical music valid extractions for the first example provided in Figure 1, but ultimately, this is a design choice that depends on the concrete application. Also, there is no fixed schema of relations to be extracted. This introduces another level of complexity, as memorization of encountered patterns for a given set of relations is insufficient for a good performance and generalization.
We highlight that single sentences frequently induce more than one triple. As an example of this, consider the lower part of Figure 1, which illustrates a sentence that consolidates two different pieces of information.
Based on the preceding deliberations, we can view  any problem instance as a tuple X, Y , where X = w 1 , w 2 , . . . , w m describes a sentence as a sequence of words, and Y = {y 1 , y 2 , . . . , y n } defines a set of valid extractions. Depending on the problem definition being used, the elements in Y are either sequences of BIO tags, i.e., y i ∈ L m for 1 ≤ i ≤ n, or triples specified in terms of spans. Independently of the employed input and output formulation, our aim in this work is to model the conditional distribution P {y | X}.

Methodology
In this work, we consider several variations and extensions of existing architectures for OIE viewed both as sequence-tagging and as span-prediction task. To that end, we subdivide models conceptually into three blocks, which we call embedding, encoding, and prediction (listed from the bottom to the top), and investigate the impact of different NN modules for each of them below. The blocks serve the obvious purposes indicated by the according designations, and are described in detail in the rest of this section.
Embedding. The embedding block represents the bottom part of our models, and serves the purpose of mapping text, given as sequence of tokens, to a sequence of embedding vectors. Traditionally, embedding vectors were (pre-)trained for a fixed vocabulary of tokens, and used directly in place of any actual tokens in a processed input sequence (Pennington et al., 2014). More recently, however, different models for computing contextualized vector representations of tokens in an input sequence have been used to represent text input (Alec et al., 2018;Devlin et al., 2018;Howard and Ruder, 2018), which induced notable progress on a multitude of NLP tasks.
We consider both approaches, and use either just word-piece embeddings (Wu et al., 2016), which we refer to as simple embedding block, or AL-BERT (Lan et al., 2019) for computing contextual-ized representations.
For the task of OIE, it is a common practice to make use of part-of-speech (PoS) tags in addition to the actual input text (Stanovsky et al., 2018;Jia and Xiang, 2019;Zhan and Zhao, 2019). We follow this, and append an embedding representing the according PoS tag to every vector produced by the used embedding block. In the case of word pieces, each sub-word token was attributed the same PoS tag as the full word it belongs to.
Encoding. The encoding block constitutes the middle part of the considered models, which processes an embedded input sequence, as provided by the module used in the embedding block, and outputs an encoded sequence of equal length. In this work, we make use of three different NN modules for encoding embedded sequences: a BiLSTM, the encoder part of a transformer, and a BiLSTM combined with a CNN, as introduced by Jia and Xiang (2019). For the simple BiLSTM encoder, we concatenate the top-layer hidden states of both directions, and use these as encodings of the respective tokens in the input sequence. The LSTM-CNN encoder processes a provided sequence in parallel with a BiLSTM and a CNN that are independent of each other. The used CNN consists of just one convolutional layer followed by max-pooling, and maps the entire embedded input sequence, viewed as matrix, to a single output vector. This vector is then concatenated with every step in the encoded sequence that is provided by the BiLSTM to yield the final output. Again, the encoding provided by the BiLSTM consists of the concatenated top-layer hidden states.
Prediction. The top block of a model takes an encoded sequence as input, and computes a probability distribution over all possible triples to extract.
We consider four different model architectures as prediction blocks, three for the sequence-tagging setting, based on LSTMs, CRFs, and multi-layer perceptrons (MLPs), and the recently introduced SpanOIE model (Zhan and Zhao, 2019), which considers OIE as a span-prediction task.
We use LSTMs for predicting label sequences leftto-right such that every step models the conditional distribution of the next label given the encoded input sequence up to the current step as well as all previously predicted labels, i.e., where e 1 , . . . , e m is an encoded sequence, as provided by the used encoding block, and y 1 , . . . , y m ∈ L m a sequence of labels.
CRFs are commonly used in combination with sequential models, such as recurrent neural networks (RNNs), since they provide a convenient way of modeling the joint distribution of entire label sequences rather than just step-wise conditional distributions. Furthermore, using CRFs on top of a recurrent model frequently results in an increased prediction accuracy (e.g., Huang et al., 2015). For our purposes, we employ a standard linear-chain CRF as prediction block.
MLPs are another family of NN modules that is frequently used for predicting labels in the context of OIE. In contrast to the previously mentioned predictors, however, they compute labels independently for each step of an input sequence, disregarding those computed for any surrounding positions. Hence, employing MLPs is based on the simplifying assumption that (2) using the same terminology as above.
Finally, we use the SpanOIE model as the only prediction block that is based on the span-formulation of OIE. For an encoded input sequence, this model computes scores independently for each triple in a set of candidates to extract, which are then normalized to yield a probability distribution. We defer the interested reader to Zhan and Zhao (2019) for any details.
Training Loss. Since prediction blocks aim to model the probability distribution over all candidate triples to extract, the negative log-likelihood (NLL) lends itself as the natural loss function to use. If we consider OIE as a sequence-tagging task, however, then optimizing the NLL requires us to deal with one problematic aspect. Among all BIO tags, the O-tag, indicating tokens that are not part of subject, predicate, or object, tends to appear at a much higher frequency than other tags. This, in turn, means that the ability to correctly predict Otags has the strongest direct influence on the NLL, and encourages models to focus too much on this label.
To account for this issue, we explore three novel training schemes, which disregard the probabilities of certain positions in a sequence for computing the NLL, as illustrated in Figure 2. A straightforward attempt to decrease the influence of O-tags on the loss term is to disregard them entirely (cf. Figure 2a). Since we model probability distributions over tag sequences as opposed to computing unnormalized scores, the law of total probability allows for a network to still learn when to predict O-tags, even though it is not explicitly trained to do so. Informally, this means that P {O} = 1 − P {B} − P {I}, which is why the trained network learns to predict O-tags when the probabilities of all remaining tags are small.
The second training scheme is based on the observation that the critical aspect of predicting a sequence of tags is identifying transitions between a block of O-tags and an element of the triple, i.e., the subject, the predicate, and the object, or directly between two of the latter. Hence, this training scheme disregards all probabilities except for those of positions that appear right before or right after a transition (cf. Figure 2b). Intuitively, this makes sense, as all the disregarded tags can be determined immediately, once we know the boundaries of the subject, the predicate, and the object.
Combining the two previously outlined ideas results in our third trained scheme, which optimizes only the first and the last position of every element of the extracted triple (cf. Figure 2c). Again, remaining tags can be determined from the knowledge of these tags only.
Finally, notice that the presented schemes are used during the training of a model only. For inference and sampling, we make use of all probabilities, including those of O-tags.

Experimental Evaluation
Dataset. For our experiments, we made use of the OIE16 benchmark dataset . With a total of 5,078 training samples, the dataset is rather small, though, making it hard to train models that generalize to unseen problem instances. For this work, we thus augmented the OIE16 training data with samples from another dataset created by Cui et al. (2018). The latter is a huge dataset, consisting of more than 36M training samples that have been generated using OpenIE4 (Angeli et al., 2015), and contains examples with lower quality than the OIE16 data. Furthermore, there is usually just one target triple to extract per sentence in this dataset, while OIE models are generally expected to find all of them.
To make effective use of the low-quality dataset, we had to perform a number of preprocessing steps (described in detail in Appendix A, which left us with a total of 1.7M training samples that were combined with the training partition of the OIE16 benchmark dataset. Experimental Evaluation. We conducted a largescale study in which we trained and evaluated all combinations of embedding, encoding, and prediction blocks introduced above. For those models containing an ALBERT embedding block, we made use of a pretrained ALBERT model (albert-base-v1) provided by Wolf et al. (2019). Due to resource constraints, we had to freeze the parameters of the ALBERT blocks to speed up the training of our models. For models containing a simple embedding block, we used word-piece embeddings (Wu et al., 2016) rather than word embeddings, since these yielded better results in our experiments, and employed the embedding vectors that were pretrained with the same ALBERT model used in the ALBERT embedding blocks to initialize the model.
Notice that the architectures considered in our study also cover all the currently most important methods of OIE that are based on deep learning (DL) (Angiras, 2018;Cui et al., 2018;Stanovsky et al., 2018;Zhan and Zhao, 2019; and a slightly modified version the model by Jia and Xiang 2019, as described in Appendix C ). We do not compare to any rule-based OIE systems, though, as they have been shown to consistently achieve inferior results than DL-based approaches in related work (Angiras, 2018;Cui et al., 2018;.
Models that use either the LSTM or the MLP prediction block were trained and evaluated for all the training schemes presented above. Other predictors cannot be used with any of the newly introduced schemes, though, and were thus trained by minimizing the standard NLL only.
All considered models were evaluated on the test partition of the OIE16 dataset with respect to both F 1 score and the area under the precision-recall curve (AUC-PR). To that end, we computed the top-20 predictions for each of the samples in the test data, using either beam search or, in the case of the CRF predictor, the Viterbi algorithm, and considered all predictions with a probability above a certain threshold as extractions of the model evaluated. As the prediction threshold, we chose the one that achieved the maximum F 1 score on the validation partition of the OIE16 dataset separately for each model. Notice that the prediction threshold was well below 0.5 in all our experiments, usually around 0.1, and thus allowed for extracting multiple triples per sample.
To determine whether a predicted triple matches any of the target triples in the test data, we employed the evaluation scheme that is typically used in the context of OIE (He et al., 2015;. 1 To that end, a predicted triple is considered correct, if each of its elements, i.e., subject, predicate, and object, contains the syntactic head of the corresponding element in the target triple. For instance, if the test set contains a triple Donald Trump, is, president of the U.S. , then the prediction Trump, is, president of the U.S. is considered correct, as »Trump« is the syntactic head of the subject phrase »Donald Trump«. Occasionally, OIE models are evaluated with predicate-head hinting (Stanovsky et al., 2018), which means that the beginning of the predicate in the target triple is marked as part of the input. More precisely, a special token is inserted into the input sequence right before the first token that is part of the predicate to extract, the so-called predicate head. This, obviously, makes the task of OIE considerably easier, and does not reflect the typical problem scenario. Nevertheless, we run all experiments a second time with predicate-head hinting, and provide the according results in Appendix B.
Due to the great number of experiments that have been conducted, it was not possible for us to perform grid search for every one of them separately. Instead, we chose a set of hyperparameters that we found to work well across a multitude of initial training runs, and kept them constant over all performed experiments. The exact hyperparameter values are reported in Appendix A. Furthermore, the code that was written for our experimental evaluation is available for download on GitHub. 2 Results. Table 1 summarizes the results of our comparative study, and provides a number of interesting insights. First and foremost, we see that the model with the transformer encoding and the LSTM prediction block achieved the best F 1 score both with and without ALBERT embedding block. The prior is also the overall best model with respect to both F 1 score and AUC-PR, and outperformed all the considered state-of-the-art approaches by a significant margin of at least 0.421 F 1 score and 0.420 AUC-PR, respectively. While one might have expected to see the transformer encoding block at the top of the score board, it is surprising that the CRF predictor performed significantly worse than the LSTM and MLP prediction blocks, as CRFs have previously achieved strong results on different task of sequence prediction (e.g., Huang et al., 2015). What comes less at surprise, though, is that using an ALBERT embedding block led to an increased performance in almost all cases, which is in line with the existing research on BERT and related models. Another important insight is that the newly introduced training schemes for sequencetagging models helped to boost performance significantly in comparison with the usual way of computing the training loss. In this context, optimizing the first and the final tokens of subject, predicate, and object only proved particularly useful (column (d) in the table), and led to the best results for almost all encoding and prediction blocks.
At this point, we want to emphasize that the results that we have achieved for models from related work differ significantly from those reported in the respective papers, which is easily explained by the difference in how models were evaluated in the same. In the context of OIE, it is common practice to pre-select candidate triples during inference, and subsequently use a model to score each of them. While this might make sense for optimizing a system in a production environment, it obfuscates a model's true ability to some extent, which is why we decided to not use any kind of pre-selection at 2 https://github.com/phohenecker/emnlp2020-oie simple→ALBERT +0.012 F1 +BiLSTM +0.051 F1 +BiLSTM-CNN +0.061 F1 +Transformer +0.063 F1 Figure 3: The average improvements caused by using an ALBERT embedding block in place of a simple one and by adding an encoding block to a model in terms of F 1 score, computed across all experiments performed. all.
Analysis and Ablation Studies. In this section, we further analyze the results presented above, including the ablations that have been performed as part of our comparative study already. Furthermore, we present additional ablation studies for the bestperforming model.
At the bare minimum, a model consists of a simple embedding block as well as one of the prediction blocks presented above, while encoding blocks are generally optional. Hence, we first investigate the effect of moving from a simple to an ALBERT embedding block on the one hand and how adding an encoding block influences a model's performance on the other- Figure 3 summarizes our insights. First and foremost, we observed that encoding blocks, on average, cause much higher improvements than using an ALBERT embedding block in place of a simple one. More precisely, ALBERT caused just a small mean improvement of 0.012 F 1 . Encoding blocks, however, pushed performance notably by 0.058 F 1 , on average, and, as expected, transformers performed best among all considered encoding blocks.
Next, we compare the various prediction blocks used in our experiments. To that end, Figure 4 illustrates the mean performance of each prediction block contrasted with the mean F 1 score computed over all experiments performed. Surprisingly, the LSTM prediction block preformed, on average, by far best among all predictors considered, and outperformed all other prediction blocks for each of the training schemes considered. Another surprise is that even the MLP predictor achieved better results than the CRF prediction block for all training schemes except (a). This contrasts previous findings on the use of CRFs for sequencetagging (Huang et al., 2015), and suggests that the training scheme, which is analyzed next, is a much bigger impact on a model's performance than certain choices about its architecture for the task of OIE. Therefore, when trained in the right way, a simple predictor, such as an MLP, can outperform   a more powerful one, such as a CRF, which does not allow for using the introduced training schemes. Finally, the mean performance of the SpanOIE prediction block was found to be in between those values observed for the CRF and the MLP predictor.
Finally, Figure 5 compares the mean performance achieved for each of the different training schemes with the overall mean F 1 score computed over all experiments with sequence-tagging models (as schemes (b) to (d) can be used with these models only). As illustrated in the figure, we observed significant differences among the training schemes. First and foremost, we see that training scheme (d), which optimizes the first and last positions of sub-jects, predicates, and objects only, clearly outperformed all other schemes. Intuitively, this makes sense, as this view reduces the problem of OIE to the absolute minimum, which is the question of where elements start and end, respectively. In contrast to this, optimizing the NLL on entire label sequences, i.e., scheme (a), led to the worst mean performance. This suggests that O-tags, which appear most frequently among all labels, are attributed too much importance in general, and steer away attention from important aspects such as the boundaries of triples' elements, which are emphasized by the other training schemes. Schemes (b) and (c) resulted in similar mean F 1 scores, close to the overall mean performance, and hence performed significantly worse than (d). Since the parts of a label sequence considered by each of these three schemes allow for reconstructing the entire sequence, one possible explanation for this is, once again, that training scheme (d) reduces the task to the absolute minimum, which might also render the according learning task as easy as possible.
We want to emphasize that the lenient evaluation scheme that is used in the context of OIE allows models to »cheat« in the sense that they can learn  to not predict O-tags at all to improve the chances that very long subjects, predicates, and objects actually cover the syntactic heads of some target triples.
To show that this does not happen with our models, we computed how often the top model predicted an O-tag correctly (as O-tag) and how many it labelled as part of subjects, predicates, and objects, respectively. Table 2 summarizes our findings, and shows that the model is highly accurate on subjects and predicates, with less than one O-tag prepended and appended to any target subject or predicate on average. For objects, we find similar numbers for prepended, but slightly higher ones for appended O's. Manual inspection of the data reveals, however, that this is justified in most cases. A good example for this is the one that has been used above: if the target object »a composer« is predicted as »a composer of classical music«, then »of classical music« is a suffix that is represented as series of O-tags in the dataset, but which may very well be considered as part of the object. Overall, the model predicted on average 78.7% of all O-tags correctly as such for each sample in the test data, which together with the other statistics supports the conclusion that it did not learn to cheat by generating overly long elements.
For our best-performing model, we conducted additional ablation studies, looking at how its performance changes when the number of layers in the encoder (1 in the top model) and the used hidden-  Table 3: Results achieved for different configurations of our best-performing model, which consists of an AL-BERT embedding block, a transformer encoding block, and an LSTM prediction block. The best configuration is printed boldface. layer size, in both encoder and predictor, (512 in the top model) is varied, as summarized in Table 3.
To that end, we noted that increasing the number of layers in the transformer encoding block resulted in lower values of both F 1 score and AUC-PR. The same was the case for both increasing and decreasing the used hidden size. Since our training data consists of a total of about 1.7M samples only, a likely explanation is that there is balance between under-and overfitting the data, which our top model seems to address well. Finally, we notice a similar pattern with respect to the different training schemes as the one discussed above.
Some of the existing works on neural OIE employ correction of malformed predicted label sequences (Stanovsky et al., 2018). To that end, orphan intermediate labels, i.e., ones that are preceeded neither by the according head nor by the same intermediate label, are corrected to O-tags. In our experiments, this approach did not lead to any improvements, and is thus not further elaborated on. Finally, Appendix B provides additional insights gained in our experiments with predicate-head hinting, which led to an average improvement of 0.074 F 1 score and 0.077 AUC-PR, respectively.

Conclusion
In this work, we systematically compared a range of different NN architectures for OIE as well as different schemes for training them. In doing so, we improved the state of the art on the OIE16 benchmark by 0.421 F 1 and 0.420 AUC-PR, respectively, (i.e., by more than 200% in both cases) with a novel model, consisting of an ALBERT embedding block, a transformer encoding block, and an LSTM prediction block, which was trained by means of a training scheme using a newly introduced loss formulation. Subsequent analysis revealed that choos-ing the right training scheme is as important as selecting the neural model architecture, as the standard NLL loss attributes too much importance to non-essential aspects of the data, and consistently leads to inferior results.

A Data Preprocessing
For this work, we augmented the OIE16 training data with samples from another dataset created by Cui et al. (2018). The latter is a huge dataset, consisting of more than 36M training samples that have been generated using OpenIE4 (Angeli et al., 2015), and contains examples with lower quality than the OIE16 data. Furthermore, there is usually just one target triple to extract per sentence in this dataset, while OIE models are generally expected to find all of them.
To make effective use of the low-quality dataset, we had to perform a number of preprocessing steps. First and foremost, some samples make use of additional tags for specifying label sequences. Following Cui et al. (2018), we discarded these tags, which results in samples that are still valid and compatible with the standard set of BIO tags, as presented above. Furthermore, Angiras (2018) observed that models trained on the huge low-quality dataset tend to saturate in the first training epoch already, usually resulting in low training, but high test error. To obtain better results, we thus filtered the dataset with respect to the distribution of predicates in the target triples. We observed that just about 8K out of more than 71K predicates appear at least 50 times, and pruned all training samples with predicates below this threshold, which left us with a dataset consisting of about 28M sentences. Next, we observed highly uneven occurrence statistics of the remaining predicates, and, following Cui et al. (2018), further down-sampled the data to avoid training on a dataset with highly imbalanced target predicates. In doing so, we proceeded as follows: • for predicates appearing less than 500 times, Sampling was performed uniformly at random, and reduced the training data to a total of 1.7M training samples, which were combined with the training partition of the OIE16 benchmark dataset. The resulting dataset can be downloaded from our GitHub repository. Table 4 summarizes the hyperparameters that were used in our experiments. These were determined over a number of initial experiments, and kept constant throughout all training runs conducted for this paper.

B Hyperparameters
C Analysis with Predicate-head Hinting Table 5 summarizes the results of our experiments with predicate-head hinting, and, as expected, we observed an average improvement of 0.089 F 1 and 0.078 AUC-PR, respectively. Surprisingly, however, the top model in terms of F 1 score in this setting was the one that uses an ALBERT embedding block, a BiLSTM encoding block, and an LSTM prediction block, followed by the model that performed best without predicate-head hinting, which uses a transformer encoding block instead. The latter was once again the top model in terms of AUC-PR, though. In the remainder of this appendix, we have a closer look at these results, which provide a few more interesting insights.
First and foremost, we observed that adding an  embedding block to a model led, on average, to a much bigger improvement than this was the case without predicate-head hinting, as illustrated in Figure 6. More precisely, we found an average improvement of 0.107 F 1 , which is about 0.047 F 1 greater than without predicate-head hinting, while the differences among the considered embedding blocks remained about the same. This suggests that all embedding blocks are able to effectively leverage provided details about the predicate head, and, in turn, create better encodings of an input sequence. Furthermore, we see that the transformer encoding block still leads to the greatest average improvement. In contrast to this, using ALBERT instead of a simple embedding block induced an equally small improvement as before.
Figure 7 provides a comparison of the different prediction blocks, and, interestingly, we see that the differences among the predictors are smaller with predicate-head hinting. One possible explanation for this is that, in this setting, the encoding block accounts for additional details of an input sequence as part of the computed encoding, which becomes possible given that the predicate head is provided as input to the model. This, in turn, reduces the impact of the employed prediction block. Furthermore, we observed that, unlike before, the CRF predictor performed, on average, slightly better than the SpanOIE prediction block, whereas LSTM and MLP predictors performed notably better, albeit with a slightly smaller gap between them as without predicate-head hinting.
Finally, Figure 8 provides a comparison of the different training schemes, and we see that the impact of the same has become bigger than this was the case without predicate-head hinting. This is somewhat surprising, but emphasizes once again that the introduced schemes are more effective than optimizing the standard NLL, and focus on relevant aspects of the considered problem, while paying less attention to incidental details.

D Adjustments of the Model by Jia and Xiang (2019)
As indicated in the main text already, our BiLSTM-CNN encoding block is a slightly modified version of the hybrid BiLSTM-CNN network introduced by Jia and Xiang (2019). In the original work, the word embeddings were passed through a BiLSTM network to give [L f , L b ], where L f is a collection of hidden states from the forward part of the BiLSTM, and L b are the hidden states from the according backward part. In addition to this, the word representations were also passed through a  CNN to compute a representation vector C. Then, all vectors in L f and L b as well as C were concatenated into a single vector, and fed into a dense layer with a softmax activation to compute predictions. In this work, however, we concatenate the CNN output with every pair of forward and backward hidden-states from the BiLSTM separately, resulting in a sequence of hidden representations that is of equal length as processed input sequence. These hidden representations are then fed into the subsequent prediction block.