IMoJIE: Iterative Memory-Based Joint Open Information Extraction

While traditional systems for Open Information Extraction were statistical and rule-based, recently neural models have been introduced for the task. Our work builds upon CopyAttention, a sequence generation OpenIE model (Cui et. al. 18). Our analysis reveals that CopyAttention produces a constant number of extractions per sentence, and its extracted tuples often express redundant information. We present IMoJIE, an extension to CopyAttention, which produces the next extraction conditioned on all previously extracted tuples. This approach overcomes both shortcomings of CopyAttention, resulting in a variable number of diverse extractions per sentence. We train IMoJIE on training data bootstrapped from extractions of several non-neural systems, which have been automatically filtered to reduce redundancy and noise. IMoJIE outperforms CopyAttention by about 18 F1 pts, and a BERT-based strong baseline by 2 F1 pts, establishing a new state of the art for the task.


Introduction
Extracting structured information from unstructured text has been a key research area within NLP. The paradigm of Open Information Extraction (OpenIE) (Banko et al., 2007) uses an open vocabulary to convert natural text to semi-structured representations, by extracting a set of (subject, relation, object) tuples. OpenIE has found wide use in many downstream NLP tasks (Mausam, 2016) like multi-document question answering and summarization (Fan et al., 2019), event schema induction (Balasubramanian et al., 2013) and word embedding generation (Stanovsky et al., 2015).
Traditional OpenIE systems are statistical or rule-based. They are largely unsupervised in nature, or bootstrapped from extractions made by earlier systems. They often consist of several components like POS tagging, and syntactic parsing. To bypass error accumulation in such pipelines, end-to-end neural systems have been proposed recently.
Recent neural OpenIE methods belong to two categories: sequence labeling, e.g., RnnOIE  and sequence generation, e.g., CopyAttention (Cui et al., 2018). In principle, generation is more powerful because it can introduce auxiliary words or change word order. However, our analysis of CopyAttention reveals that it suffers from two drawbacks. First, it does not naturally adapt the number of extractions to the length or complexity of the input sentence. Second, it is susceptible to stuttering: extraction of multiple triples bearing redundant information.
These limitations arise because its decoder has no explicit mechanism to remember what parts of the sentence have already been 'consumed' or what triples have already been generated. Its decoder uses a fixed-size beam for inference. However, beam search can only ensure that the extractions are not exact duplicates.
In response, we design the first neural OpenIE system that uses sequential decoding of tuples conditioned on previous tuples. We achieve this by adding every generated extraction so far to the encoder. This iterative process stops when the EndOfExtractions tag is generated by the decoder, allowing it to produce a variable number of extractions. We name our system Iterative MemOry Joint Open Information Extraction (IMOJIE).
CopyAttention uses a bootstrapping strategy, where the extractions from OpenIE-4 Pal and Mausam, 2016) are used as training data. However, we believe that training on extractions of multiple systems is preferable. For example, OpenIE-4 benefits from high precision compared to ClausIE (Del Corro and Gemulla, 2013), which offers high recall. By aggregating extractions from both, IMOJIE could potentially

Sentence
Greek and Roman pagans , who saw their relations with the gods in political and social terms , scorned the man who constantly trembled with fear at the thought of the gods , as a slave might fear a cruel and capricious master . OpenIE-4 ( the man ; constantly trembled ; ) IMOJIE ( a slave ; might fear ; a cruel and capricious master ) ( Greek and Roman pagans ; scorned ; the man who ... capricious master ) ( the man ; constantly trembled ; with fear at the thought of the gods ) ( Greek and Roman pagans ; saw ; their relations with the gods in political and social terms ) obtain a better precision-recall balance.
However, simply concatenating extractions from multiple systems does not work well, as it leads to redundancy as well as exaggerated noise in the dataset. We devise an unsupervised Score-and-Filter mechanism to automatically select a subset of these extractions that are non-redundant and expected to be of high quality. Our approach scores all extractions with a scoring model, followed by filtering to reduce redundancy.
We compare IMOJIE against several neural and non-neural systems, including our extension of CopyAttention that uses BERT (Devlin et al., 2019) instead of an LSTM at encoding time, which forms a very strong baseline. On the recently proposed CaRB metric, which penalizes redundant extractions (Bhardwaj et al., 2019)
Recently, to reduce error accumulation in these pipeline systems, neural OpenIE models have been proposed. They belong to one of two paradigms: sequence labeling or sequence generation. Sequence Labeling involves tagging each word in the input sentence as belonging to the subject, predicate, object or other. The final extraction is obtained by collecting labeled spans into different fields and constructing a tuple. RnnOIE ) is a labeling system that first identifies the relation words and then uses sequence labelling to get their arguments. It is trained on OIE2016 dataset, which postprocesses SRL data for OpenIE . SenseOIE (Roy et al., 2019), improves upon Rn-nOIE by using the extractions of multiple OpenIE systems as features in a sequence labeling setting. However, their training requires manually annotated gold extractions, which is not scalable for the task. This restricts SenseOIE to train on a dataset of 3,000 sentences. In contrast, our proposed Score-and-Filter mechanism is unsupervised and can scale unboundedly. Jiang et al. (2019) is another labeling system that better calibrates extractions across sentences.
SpanOIE (Zhan and Zhao, 2020) uses a span selection model, a variant of the sequence labelling paradigm. Firstly, the predicate module finds the predicate spans in a sentence. Subsequently, the argument module outputs the arguments for this predicate. However, SpanOIE cannot extract nominal relations. Moreover, it bootstraps its training data over a single OpenIE system only. In contrast, IMOJIE overcomes both of these limitations.
Sequence Generation uses a Seq2Seq model to generate output extractions one word at a time. The generated sequence contains field demarcators, which are used to convert the generated flat sequence to a tuple. CopyAttention (Cui et al., 2018) is a neural generator trained over bootstrapped data generated from OpenIE-4 extractions on a large corpus. During inference, it uses beam search to get the predicted extractions. It uses a fixed-size beam, limiting it to output a constant number of extractions per sentence. Moreover, our analysis shows that CopyAttention extractions severely lack in diversity, as illustrated in Table 1.  propose the Logician model, a restricted sequence generation model for extracting tuples from Chinese text. Logician relies on coverage attention and gated-dependency attention, a language-specific heuristic for Chinese. Using coverage attention, the model also tackles generation of multiple extractions while being globally-aware.
We compare against Logician's coverage attention as one of the approaches for increasing diversity.
Sequence-labeling based models lack the ability to change the sentence structure or introduce new auxiliary words while uttering predictions. For example, they cannot extract (Trump, is the President of, US) from "US President Trump", since 'is', 'of' are not in the original sentence. On the other hand, sequence-generation models are more general and, in principle, need not suffer from these limitations.
Evaluation: All neural models have shown improvements over the traditional systems using the OIE2016 benchmark. However, recent work shows that the OIE2016 dataset is quite noisy, and that its evaluation does not penalize highly redundant extractions (Léchelle et al., 2018). In our work, we use the latest CaRB benchmark, which crowdsources a new evaluation dataset, and also provides a modified evaluation framework to downscore near-redundant extractions (Bhardwaj et al., 2019).

Sequential Decoding
We now describe IMOJIE, our generative approach that can output a variable number of diverse extractions per sentence. The architecture of our model is illustrated in Figure 1. At a high level, the next extraction from a sentence is best determined in context of all other tuples extracted from it so far. Hence, IMOJIE uses a decoding strategy that generates extractions in a sequential fashion, one after another, each one being aware of all the ones generated prior to it.
This kind of sequential decoding is made possible by the use of an iterative memory. Each of the generated extractions are added to the memory so that the next iteration of decoding has access to all of the previous extractions. We simulate this iterative memory with the help of BERT encoder, whose input includes the [CLS] token and original IMOJIE uses an LSTM decoder, which is initialized with the embedding of [CLS] token. The contextualized-embeddings of all the word tokens are used for the Copy (Gu et al., 2016) and Attention (Bahdanau et al., 2015) modules. The decoder generates the tuple one word at a time, producing rel and obj tokens to indicate the start of relation and object respectively. The iterative process continues until the EndOfExtractions token is generated.
The overall process can be summarized as: 1. Pass the sentence through the Seq2Seq architecture to generate the first extraction. 2. Concatenate the generated extraction with the existing input and pass it again through the Seq2Seq architecture to generate the next extraction.

Repeat
Step 2 until the EndOfExtractions token is generated. IMOJIE is trained using a cross-entropy loss between the generated output and the gold output.

Single Bootstrapping System
To train generative neural models for the task of OpenIE, we need a set of sentence-extraction pairs. It is ideal to curate such a training dataset via human annotation, but that is impractical, considering the scale of training data required for a neural model. We follow Cui et al. (2018), and use bootstrapping -using extractions from a pre-existing OpenIE system as 'silver'-labeled (as distinct from 'gold'-labeled) instances to train the neural model. We first order all extractions in the decreasing order of confidences output by the original system. We then construct training data in IMOJIE's inputoutput format, assuming that this is the order in which it should produce its extractions.

Multiple Bootstrapping Systems
Different OpenIE systems have diverse quality characteristics. For example, the human-estimated (precision, recall) of OpenIE-4 is (61, 43) while that of ClausIE is (40, 50). Thus, by using their combined extractions as the bootstrapping dataset, we might potentially benefit from the high precision of OpenIE-4 and high recall of ClausIE.
However, simply pooling all extractions would not work, because of the following serious hurdles. No calibration: Confidence scores assigned by different systems are not calibrated to a comparable scale. Redundant extractions: Beyond exact duplicates, multiple systems produce similar extractions with low marginal utility. Wrong extractions: Pooling inevitably pollutes the silver data and can amplify incorrect instances, forcing the downstream open IE system to learn poor-quality extractions. We solve these problems using a Score-and-Filter framework, shown in Figure 2. Scoring: All systems are applied on a given sentence, and the pooled set of extractions are scored such that good (correct, informative) extractions generally achieve higher values compared to bad (incorrect) and redundant ones. In principle, this score may be estimated by the generation score from IMOJIE, trained on a single system. In practice, such a system is likely to consider extractions similar to its bootstrapping training data as good, while disregarding extractions of other systems, even though those extractions may also be of high quality. To mitigate this bias, we use an IMOJIE model, pre-trained on a random bootstrapping dataset. The random bootstrapping dataset is generated by picking extractions for each sentence randomly from any one of the bootstrapping systems being aggregated. We assign a score to each extraction in the pool based on the confidence value given to it by this IMOJIE (Random) model. Filtering: We now filter this set of extractions for redundancy. Given the set of ranked extractions in the pool, we wish to select that subset of extractions that have the best confidence scores (assigned by the random-boostrap model), while having minimum similarity to the other selected extractions.
We model this goal as the selection of an optimal subgraph from a suitably designed complete weighted graph. Each node in the graph corresponds to one extraction in the pool. Every pair of nodes (u, v) are connected by an edge. Every edge has an associated weight R(u, v) signifying the similarity between the two corresponding extractions. Each node u is assigned a score f (u) equal to the confidence given by the random-bootstrap model.
Given this graph G = (V, E) of all pooled extractions of a sentence, we aim at selecting a subgraph G = (V , E ) with V ⊆ V , such that the most significant ones are selected, whereas the extractions redundant with respect to already-selected ones are discarded. Our objective is where u i represents node i ∈ V . We compute R(u, v) as the ROUGE2 score between the serialized triples represented by nodes u and v. We can intuitively understand the first term as the aggregated sum of significance of all selected triples and second term as the redundancy among these triples.
If G has n nodes, we can pose the above objective as: max where f ∈ R n representing the node scores, i.e., indicating whether a particular node u i ∈ V or not. This is an instance of Quadratic Boolean Programming and is NP-hard, but in our application n is modest enough that this is not a concern. We use the QPBO (Quadratic Pseudo Boolean Optimizer) solver 2 (Rother et al., 2007) to find the optimal x * and recover V .

Training Data Construction
We obtain our training sentences by scraping Wikipedia, because Wikipedia is a comprehensive source of informative text from diverse domains, rich in entities and relations. Using sentences from Wikipedia ensures that our model is not biased towards data from any single domain. We run OpenIE-4 3 , ClausIE 4 and RnnOIE 5 on these sentences to generate a set of OpenIE tuples for every sentence, which are then ranked and filtered using our Score-and-Filter technique. These tuples are further processed to generate training instances in IMOJIE's input-output format.
Each sentence contributes to multiple (input, output) pairs for the IMOJIE model. The first training instance contains the sentence itself as input and the first tuple as output. For example, ("I ate an apple and an orange.", "I; ate; an apple"). The next training instance, contains the sentence concatenated with previous tuple as input and the next tuple as output ("I ate an apple and an orange. [SEP] I; ate; an apple", "I; ate; an orange"). The final training instance generated from this sentence includes all the extractions appended to the sentence as input and EndOfExtractions token as the output. Every sentence gives the seq2seq learner one training instance more than the number of tuples.
While forming these training instances, the tuples are considered in decreasing order of their confidence scores. If some OpenIE system does not provide confidence scores for extracted tuples, then the output order of the tuples may be used.

Dataset and Evaluation Metrics
We use the CaRB data and evaluation framework (Bhardwaj et al., 2019) to evaluate the systems 6 at different confidence thresholds, yielding a precision-recall curve. We identify three important summary metrics from the P-R curve. Optimal F1: We find the point in the P-R curve corresponding to the largest F1 value and report that. This is the operating point for getting extractions with the best precision-recall trade-off. AUC: This is the area under the P-R curve. This metric is useful when the downstream application can use the confidence value of the extraction. Last F1: This is the F1 score computed at the point of zero confidence. This is of importance when we cannot compute the optimal threshold, due to lack of any gold-extractions for the domain.  Many downstream applications of OpenIE, such as text comprehension (Stanovsky et al., 2015) and sentence similarity estimation (Christensen et al., 2014), use all the extractions output by the OpenIE system. Last F1 is an important measure for such applications.

Comparison Systems
We compare IMOJIE against several nonneural baselines, including Stanford-IE, OpenIE-4, OpenIE-5, ClausIE, PropS, MinIE, and OLLIE. We also compare against the sequence labeling baselines of RnnOIE, SenseOIE, and the span selection baseline of SpanOIE. Probably the most closely related baseline to us is the neural generation baseline of CopyAttention. To increase CopyAttention's diversity, we compare against an English version of Logician, which adds coverage attention to a singledecoder model that emits all extractions one after another. We also compare against CopyAttention augmented with diverse beam search (Vijayakumar et al., 2018) -it adds a diversity term to the loss function so that new beams have smaller redundancy with respect to all previous beams. Finally, because our model is based on BERT, we reimplement CopyAttention with a BERT encoder -this forms a very strong baseline for our task.

Implementation
We implement IMOJIE in the AllenNLP framework 7 (Gardner et al., 2018)    hyper-parameters include learning rate for BERT, set to 2 × 10 −5 , and learning rate, hidden dimension, and word embedding dimension of the decoder LSTM, set to (10 −3 , 256, 100), respectively. Since the model or code of CopyAttention (Cui et al., 2018) were not available, we implemented it ourselves. Our implementation closely matches their reported scores, achieving (F1, AUC) of (56.4, 47.7) on the OIE2016 benchmark.
6 Results and Analysis

Performance of Existing Systems
How well do the neural systems perform as compared to the rule-based systems?
Using CaRB evaluation, we find that, contrary to previous papers, neural OpenIE systems are not necessarily better than prior non-neural systems (Table 3). Among the systems under consideration, the best non-neural system reached Last F1 of 51.5, whereas the best existing neural model could only reach 49.2. Deeper analysis reveals that CopyAttention produces redundant extractions conveying nearly the same information, which CaRB effectively penalizes. RnnOIE performs much better, however suffers due to its lack of generating auxilliary verbs and implied prepositions. Example, it can only generate (Trump; President; US) instead of (Trump; is President of; US) from the sentence   "US President Trump...". Moreover, it is trained only on limited number of pseudo-gold extractions, generated by , which does not take advantage of boostrapping techniques.

Performance of IMOJIE
How does IMOJIE perform compared to the previous neural and rule-based systems?
In comparison with existing neural and nonneural systems, IMOJIE trained on aggregated bootstrapped data performs the best. It outperforms OpenIE-4, the best existing OpenIE system, by 1.9 F1 pts, 3.8 pts of AUC, and 1.8 pts of Last-F1. Qualitatively, we find that it makes fewer mistakes than OpenIE-4, probably because OpenIE-4 accumulates errors from upstream parsing modules (see Table 2). IMOJIE outperforms CopyAttention by large margins -about 18 Optimal F1 pts and 13 AUC pts. Qualitatively, it outputs non-redundant extractions through the use of its iterative memory (see Table 1), and a variable number of extractions owing to the EndofExtractions token. It also outperforms CopyAttention with BERT, which is a very strong baseline, by 1.9 Opt. F1 pts, 0.5 AUC and 3.7 Last F1 pts. IMOJIE consistently outperforms Copy-Attention with BERT over different bootstrapping datasets (see Table 8). Figure 3 shows that the precision-recall curve of IMOJIE is consistently above that of existing OpenIE systems, emphasizing that IMOJIE is consistently better than them across the different confidence thresholds. We do find that CopyAtten-tion+BERT outputs slightly higher recall at a significant loss of precision (due to its beam search with constant size), which gives it some benefit in the overall AUC. CaRB evaluation of SpanOIE 8 results in (precision, recall, F1) of (58.9, 40.3, 47.9). SpanOIE sources its training data only from OpenIE-4. In order to be fair, we compare it against IMOJIE trained only on data from OpenIE-4 which evaluates to (60.4, 46.3, 52.4). Hence, IMOJIE outperforms SpanOIE, both in precision and recall.
Attention is typically used to make the model focus on words which are considered important for the task. But the IMOJIE model successfully uses attention to forget certain words, those which are already covered. Consider, the sentence "He served as the first prime minister of Australia and became a founding justice of the High Court of Australia". Given the previous extraction (He; served; as the first prime minister of Australia), the BERTs attention layers figure out that the words 'prime' and 'minister' have already been covered, and thus push the decoder to prioritize 'founding' and 'justice'. Appendix D analyzes the attention patterns of the model when generating the intermediate extraction in the above example and shows that IMOJIE gives less attention to already covered words.

Redundancy
What is the extent of redundancy in IMOJIE when compared to earlier OpenIE systems?
We also investigate other approaches to reduce redundancy in CopyAttention, such as Logician's coverage attention (with both an LSTM and a BERT encoder) as well as diverse beam search. Table 4 reports that both these approaches indeed make significant improvements on top of CopyAttention scores. In particular, qualitative analysis of diverse beam search output reveals that the model gives out different words in different tuples in an effort to be diverse, without considering their correctness. Moreover, since this model uses beam search, it still outputs a fixed number of tuples.
This analysis naturally suggested the IMO-JIE (w/o BERT) model -an IMOJIE variation that uses an LSTM encoder instead of BERT. Un-  fortunately, IMOJIE (w/o BERT) is behind the CopyAttention baseline by 12.1 pts in AUC and 4.4 pts in Last F1. We hypothesize that this is because the LSTM encoder is unable to learn how to capture inter-fact dependencies adequately -the input sequences are too long for effectively training LSTMs. This explains our use of Transformers (BERT) instead of the LSTM encoder to obtain the final form of IMOJIE. With a better encoder, IMOJIE is able to perform up to its potential, giving an improvement of (17.8, 12.7, 19.6) pts in (Optimal F1, AUC, Last F1) over existing seq2seq OpenIE systems.
We further measure two quantifiable metrics of redundancy: Mean Number of Occurrences (MNO): The average number of tuples, every output word appears in. Intersection Over Union (IOU): Cardinality of intersection over cardinality of union of words in the two tuples, averaged over all pairs of tuples. These measures were calculated after removing stop words from tuples. Higher value of these measures suggest higher redundancy among the extractions. IMOJIE is significantly better than Copy-Attention+BERT, the strongest baseline, on both these measures (Table 7). Interestingly, IMOJIE has a lower redundancy than even the gold triples; this is due to imperfect recall.

The Value of Iterative Memory
To what extent does the IMOJIE style of generating tuples improve performance, over and above the use of BERT?
We add BERT to CopyAttention model to generate another baseline for a fair comparison against the IMOJIE model. When trained only on OpenIE-4, IMOJIE continues to outperform CopyAtten-tion+BERT baseline by (1.6, 0.3, 2.8) pts in (Optimal F1, AUC, Last F1), which provides strong evidence that the improvements are not solely by virtue of using a better encoder. We repeat this experiment over different (single) bootstrapping datasets. Table 8 depicts that IMOJIE consistently outperforms CopyAttention+BERT model.
We also note that the order in which the extractions are presented to the model (during training) is indeed important. On training IMoJIE using a randomized-order of extractions, we find a decrease of 1.6 pts in AUC (averaged over 3 runs).

The value of Score-and-Filter
To what extent does the scoring and filtering approach lead to improvement in performance?
IMOJIE aggregates extractions from multiple systems through the scoring and filtering approach. It uses extractions from , ClausIE (202K) and RnnOIE (230K) to generate a set of 215K tuples. Table 6 reports that IMOJIE does not perform well when this aggregation mechanism is turned off. We also try two supervised approaches to aggregation, by utilizing the gold extractions from CaRB's dev set.
• Extraction Filtering: For every sentence-tuple pair, we use a binary classifier that decides whether or not to consider that extraction. The input features of the classifier are the [CLS]embeddings generated from BERT after processing the concatenated sentence and extraction. The classifier is trained over tuples from CaRB's dev set. • Sentence Filtering: We use an IMOJIE model , to score all the tuples. Then, a Multilayer Perceptron (MLP) predicts a confidence threshold to perform the filtering. Only extractions with scores greater than this threshold will be considered. The input features of the MLP include the length of sentence, IMOJIE (OpenIE-4) scores, and GPT (Radford et al., 2018) scores of each extraction. This MLP is trained over sentences from CaRB's dev set and the gold optimal confidence threshold calculated by CaRB. We observe that the Extraction, Sentence Filtering are better than no filtering by by 7.5, 11.2 pts in Last F1, but worse at Opt. F1 and AUC. We hypothesise that this is because the training data for the MLP (640 sentences in CaRB's dev set), is not sufficient and the features given to it are not sufficiently discriminative. Thereby, we see the value of our unsupervised Score-and-Filter that improves the performance of IMOJIE by (3.8, 15.9)
Computational Cost: The training times for Copy-Attention+BERT, IMOJIE (OpenIE-4) and IMO-JIE (including the time taken for Score-and-Filter) are 5 hrs, 13 hrs and 30 hrs respectively. This shows that the performance improvements come with an increased computational cost, and we leave it to future work to improve the computational efficiency of these models.

Error Analysis
We randomly selected 50 sentences from the CaRB validation set. We consider only sentences where at least one of its extractions shows the error. We identified four major phenomena contributing to errors in the IMOJIE model: (1) Missing information: 66% of the sentences have at least one of the relations or arguments or both missing in predicted extractions, which are present in gold extractions. This leads to incomplete information.
(2) Incorrect demarcation: Extractions in 60% of the sentences have the separator between relation and argument identified at the wrong place.
(4) Grammatically incorrect extractions: 38% sentences have a grammatically incorrect extraction (when serialized into a sentence). Additionally, we observe 12% sentences still suffering from redundant extractions and 4% miscellaneous errors.

Conclusions and Discussion
We propose IMOJIE for the task of OpenIE. IMO-JIE significantly improves upon the existing Ope-nIE systems in all three metrics, Optimal F1, AUC, and Last F1, establishing a new State Of the Art system. Unlike existing neural OpenIE systems, IMO-JIE produces non-redundant as well as a variable number of OpenIE tuples depending on the sentence, by iteratively generating them conditioned on the previous tuples. Additionally, we also contribute a novel technique to combine multiple Ope-nIE datasets to create a high-quality dataset in a completely unsupervised manner. We release the training data, code, and the pretrained models. 9 IMOJIE presents a novel way of using attention for text generation. Bahdanau et al. (2015) showed that attending over the input words is important for text generation. See et al. (2017) showed that using a coverage loss to track the attention over the decoded words improves the quality of the generated output. We add to this narrative by showing that deep inter-attention between the input and the partially-decoded words (achieved by adding previous output in the input) creates a better representation for iterative generation of triples. This general observation may be of independent interest beyond OpenIE, such as in text summarization.

IMOJIE: Iterative Memory-Based Joint Open Information Extraction (Supplementary Material) A Performance with varying sentence lengths
In this experiment, we measure the performance of baseline and our models by testing on sentences of varying lengths. We partition the original CaRB test data into 6 datasets with sentences of lengths (9-16 words), (17-24 words), (25-32 words), (33-40 words), (41-48 words) and (49-62 words) respectively. Note that the minimum and maximum sentence lengths are 9 and 62 respectively. We measure the Optimal F1 score of both Copy Attention + BERT and IMOJIE (Bootstrapped on OpenIE-4) on these partitions as depicted in Figure 4. We observe that the performance deteriorates with increasing sentence length which is expected as well. Also, for each of the partitions, IMOJIE marginally performs better as compared to Copy Attention + BERT.

B Measuring Performance on Varying Beam Size
We perform inference of the CopyAttention with BERT model on CaRB test set with beam sizes of 1, 3, 5, 7, and 11. We observe in Figure 5 that AUC increases with increasing beam size. A system can surge its AUC by adding several low confidence tuples to its predicted set of tuples. This adds low precision -high recall points to the Precision-Recall curve of the system leading to higher AUC. On the other hand, Last F1 experiences a drop at very high beam sizes, thereby capturing the decline in performance. Optimal F1 saturates at high beam sizes since its calculation ignores the extractions Figure 5: Measuring performance of CopyAttention with BERT model upon changing the beam size below the optimal confidence threshold. This analysis also shows the importance of using Last F1 as a metric for measuring the performance of OpenIE systems.

C Evaluation on other datasets
We use sentences from other benchmarks with the CaRB evaluation policy and we find similar improvements, as shown in Table 9. IMOJIE consistently outperforms our strongest baseline, CopyAttention with BERT, over different test sets. This confirms that IMOJIE is domain agnostic.

D Visualizing Attention
Attention has been used in a wide variety of settings to help the model learn to focus on important things (Bahdanau et al., 2015;Xu et al., 2015;Lu et al., 2019). However, the IMOJIE model is able to use attention to understand which words have already been generated, to focus on remaining words. In order to understand how the model achieves this, we visualize the learnt attention weights. There are two attention weights of importance, the learnt attention inside the BERT encoder and the attention between the decoder and encoder. We use BertViz (Vig, 2019) to visualize the attention inside BERT. We consider the following sentence as the running example -"he served as the first prime minister of australia and became a founding justice of the high court of australia". We visualize the attention after producing the first extraction -"he; served; as the first prime minister of australia". Intuitively, we understand that the model must focus on the words "founding" and "justice" in order to generate the next extraction -"he; became; a founding justice of the high court of australia". In Figure 8 and Figure  9 (where the left-hand column contains the words Model Dataset Wire57 Penn Web CopyAttention + BERT 45. 60,27.70,39.70 18.20,7.9,12.40 30.10,18.00,14.60 IMOJIE 46.20,26.60,46.20 20.20,8.70,15.50 30.40,15.50,26.40  which are used to attend while right-hand column contains the words which are attended over), we see that the words "prime" and "minister" of the original sentence have high attention over the same words in the first extraction. But the attention for "founding" and "justice" are limited to the original sentence.
Based on these patterns, the decoder is able to give a high attention to the words "founding" and "justice" (as shown in Figure 10), in-order to successfully generate the second extraction "he; became; a founding justice of the high court of australia".