Improving Open Information Extraction via Iterative Rank-Aware Learning

Open information extraction (IE) is the task of extracting open-domain assertions from natural language sentences. A key step in open IE is confidence modeling, ranking the extractions based on their estimated quality to adjust precision and recall of extracted assertions. We found that the extraction likelihood, a confidence measure used by current supervised open IE systems, is not well calibrated when comparing the quality of assertions extracted from different sentences. We propose an additional binary classification loss to calibrate the likelihood to make it more globally comparable, and an iterative learning process, where extractions generated by the open IE model are incrementally included as training samples to help the model learn from trial and error. Experiments on OIE2016 demonstrate the effectiveness of our method. Code and data are available at https://github.com/jzbjyb/oie_rank.

A key step in open IE is confidence modeling, which ranks a list of candidate extractions based on their estimated quality. This is important for downstream tasks, which rely on tradeoffs between the precision and recall of extracted 1 Code and data are available at https://github. com/jzbjyb/oie_rank assertions. For instance, an open IE-powered medical question answering (QA) system may require its assertions in higher precision (and consequently lower recall) than QA systems for other domains. For supervised open IE systems, the confidence score of an assertion is typically computed based on its extraction likelihood given by the model (Stanovsky et al., 2018;Sun et al., 2018). However, we observe that this often yields sub-optimal ranking results, with incorrect extractions of one sentence having higher likelihood than correct extractions of another sentence. We hypothesize this is due to the issue of a disconnect between training and test-time objectives. Specifically, the system is trained solely to raise likelihood of gold-standard extractions, and during training the model is not aware of its test-time behavior of ranking a set of system-generated assertions across sentences that potentially include incorrect extractions.
To calibrate open IE confidences and make them more globally comparable across different sentences, we propose an iterative rank-aware learning approach, as outlined in Fig. 1. Given extractions generated by the model as training samples, we use a binary classification loss to explicitly increase the confidences of correct extractions and decrease those of incorrect ones. Without adding additional model components, this training paradigm naturally leads to a better open IE model, whose extractions can be further included as training samples. We further propose an iter-ative learning procedure that gradually improves the model by incrementally adding extractions to the training data. Experiments on the OIE2016 dataset  indicate that our method significantly outperforms both neural and non-neural models.

Neural Models for Open IE
We briefly revisit the formulation of open IE and the neural network model used in our paper.

Problem Formulation
Given sentence s = (w 1 , w 2 , ..., w n ), the goal of open IE is to extract assertions in the form of tuples r = (p, a 1 , a 2 , ..., a m ), composed of a single predicate and m arguments. Generally, these components in r need not to be contiguous, but to simplify the problem we assume they are contiguous spans of words from s and there is no overlap between them.
Methods to solve this problem have recently been formulated as sequence-to-sequence generation (Cui et al., 2018;Sun et al., 2018;Duh et al., 2017) or sequence labeling (Stanovsky et al., 2018;Jia et al., 2018). We adopt the second formulation because it is simple and can take advantage of the fact that assertions only consist of words from the sentence. Within this framework, an assertion r can be mapped to a unique BIO (Stanovsky et al., 2018) label sequence y by assigning O to the words not contained in r, B p /I p to the words in p, and B a i /I a i to the words in a i respectively, depending on whether the word is at the beginning or inside of the span.
The label predictionŷ is made by the model given a sentence associated with a predicate of interest (s, v). At test time, we first identify verbs in the sentence as candidate predicates. Each sentence/predicate pair is fed to the model and extractions are generated from the label sequence.

Model Architecture and Decoding
Our training method in § 3 could potentially be used with any probabilistic open IE model, since we make no assumptions about the model and only the likelihood of the extraction is required for iterative rank-aware learning. As a concrete instantiation in our experiments, we use RnnOIE (Stanovsky et al., 2018;He et al., 2017), a stacked BiLSTM with highway connections (Zhang et al., 2016;Srivastava et al., 2015) and recurrent dropout (Gal and Ghahramani, 2016). Input of the model is the concatenation of word embedding and another embedding indicating whether this word is predicate: The probability of the label at each position is calculated independently using a softmax function: where h t is the hidden state of the last layer. At decoding time, we use the Viterbi algorithm to reject invalid label transitions (He et al., 2017), such as B a 2 followed by I a 1 . 2 We use average log probability of the label sequence (Sun et al., 2018) as its confidence: 3 The probability is trained with maximum likelihood estimation (MLE) of the gold extractions. This formulation lacks an explicit concept of cross-sentence comparison, and thus incorrect extractions of one sentence could have higher confidence than correct extractions of another sentence.

Iterative Rank-Aware Learning
In this section, we describe our proposed binary classification loss and iterative learning procedure.

Binary Classification Loss
To alleviate the problem of incomparable confidences across sentences, we propose a simple binary classification loss to calibrate confidences to be globally comparable. Given a model θ trained with MLE, beam search is performed to generate assertions with the highest probabilities for each predicate. Assertions are annotated as either positive or negative with respect to the gold standard, and are used as training samples to minimize the hinge loss:  where D is the training sentence collection, g θ represents the candidate generation process, and t ∈ {1, −1} is the binary annotation. c θ (s, v,ŷ) is the confidence score calculated by average log probability of the label sequence. The binary classification loss distinguishes positive extractions from negative ones generated across different sentences, potentially leading to a more reliable confidence measure and better ranking performance.

Iterative Learning
Compared to using external models for confidence modeling, an advantage of the proposed method is that the base model does not change: the binary classification loss just provides additional supervision. Ideally, the resulting model after one-round of training becomes better not only at confidence modeling, but also at assertion generation, suggesting that extractions of higher quality can be added as training samples to continue this training process iteratively. The resulting iterative learning procedure (Alg. 1) incrementally includes extractions generated by the current model as training samples to optimize the binary classification loss to obtain a better model, and this procedure is continued until convergence.

Experimental Settings
Dataset We use the OIE2016 dataset  to evaluate our method, which only contains verbal predicates. OIE2016 is automatically generated from the QA-SRL dataset (He et al., 2015), and to remove noise, we remove extractions without predicates, with less than two arguments, and with multiple instances of an argument. The statistics of the resulting dataset are summarized in Tab. 1.

Evaluation Metrics
We follow the evaluation metrics described by : area under the precision-recall curve (AUC) and F1 score. An extraction is judged as correct if the predicate and arguments include the syntactic head of the gold standard counterparts. 4 Baselines We compare our method with both competitive neural and non-neural models, including RnnOIE (Stanovsky et al., 2018), OpenIE4, 5 ClausIE (Corro andGemulla, 2013), and PropS . Implementation Details Our implementation is based on AllenNLP  by adding binary classification loss function on the implementation of RnnOIE. 6 The network consists of 4 BiLSTM layers (2 forward and 2 backward) with 64-dimensional hidden units. ELMo  is used to map words into contextualized embeddings, which are concatenated with a 100-dimensional predicate indicator embedding. The recurrent dropout probability is set to 0.1. Adadelta (Zeiler, 2012) with = 10 −6 and ρ = 0.95 and mini-batches of size 80 are used to optimize the parameters. Beam search size is 5.

Evaluation Results
Tab. 4 lists the evaluation results. Our base model (RnnOIE, § 2) performs better than non-neural systems, confirming the advantage of supervised training under the sequence labeling setting. To test if the binary classification loss (E.q. 2, § 3) could yield better-calibrated confidence, we perform one round of fine-tuning of the base model with the hinge loss (+Binary loss in Tab. 4). We show both the results of using the confidence (E.q. 1) of the fine-tuned model to rerank the extractions of the base model (Rerank Only), and the end-to-end performance of the fine-tuned model in assertion generation (Generate). We 4 The absolute performance reported in our paper is much lower than the original paper because the authors use a more lenient lexical overlap metric in their released code: https://github.com/gabrielStanovsky/ oie-benchmark. 5 https://github.com/dair-iitd/ OpenIE-standalone 6 https://allennlp.org/models# open-information-extraction sentence old new label rank rank A CEN forms an important but small part of a Local Strategic Partnership .
3 1 An animal that cares for its young but shows no other sociality traits is said to be " subsocial" . 2 2 A casting director at the time told Scott that he had wished that he'd met him a week before ; he was casting for the "G.I. Joe" cartoon.
1 3 A motorcycle speedway long-track meeting , one of the few held in the UK, was staged at Ammanford.  found both settings lead to improved performance compared to the base model, which demonstrates that calibrating confidence using binary classification loss can improve the performance of both reranking and assertion generation. Finally, our proposed iterative learning approach (Alg. 1, § 3) significantly outperforms non-iterative settings.
We also investigate the performance of our iterative learning algorithm with respect to the number of iterations in Fig. 2. The model obtained at each iteration is used to both rerank the extractions generated by the previous model and generate new extractions. We also report results of using only positive samples for optimization. We observe the AUC and F1 of both reranking and generation increases simultaneously for the first 6 iterations and converges after that, which demonstrates the effectiveness of iterative training. The best performing iteration achieves AUC of 0.125 and F1 of 0.315, outperforming all the baselines by a large margin. Meanwhile, using both positive and negative samples consistently outperforms only using positive samples, which indicates the necessity of exposure to the errors made by the system.

System AUC F1
Non-neural Systems PropS  .006 .065 ClausIE (Corro and Gemulla, 2013) .  Case Study Tab. 2 compares extractions from RnnOIE before and after reranking. We can see the order is consistent with the annotation after reranking, showing the additional loss function's efficacy in calibrating the confidences; this is particularly common in extractions with long arguments. Tab. 3 shows a positive extraction discovered after iterative training (first example), and a wrong extraction that disappears (second example), which shows that the model also becomes better at assertion generation.
Error Analysis Why is the performance still relatively low? We randomly sample 50 extractions generated at the best performing iteration and conduct an error analysis to answer this question. To count as a correct extraction, the number and order of the arguments should be exactly the same as the ground truth and syntactic heads must be included, which is challenging considering that the OIE2016 dataset has complex syntactic structures and multiple arguments per predicate. We classify the errors into three categories and summarize their proportions in Tab. 5. "Overgenerated predicate" is where predicates not included in ground truth are overgenerated, because all the verbs are used as candidate predicates. An ef-  fective mechanism should be designed to reject useless candidates. "Wrong argument" is where extracted arguments do not coincide with ground truth, which is mainly caused by merging multiple arguments in ground truth into one. "Missing argument" is where the model fails to recognize arguments. These two errors usually happen when the structure of the sentence is complicated and coreference is involved. More linguistic information should be introduced to solve these problems.

Conclusion
We propose a binary classification loss function to calibrate confidences in open IE. Iteratively optimizing the loss function enables the model to incrementally learn from trial and error, yielding substantial improvement. An error analysis is performed to shed light on possible future directions.