Syntactic and Semantic-driven Learning for Open Information Extraction

One of the biggest bottlenecks in building accurate, high coverage neural open IE systems is the need for large labelled corpora. The diversity of open domain corpora and the variety of natural language expressions further exacerbate this problem. In this paper, we propose a syntactic and semantic-driven learning approach, which can learn neural open IE models without any human-labelled data by leveraging syntactic and semantic knowledge as noisier, higher-level supervision. Specifically, we first employ syntactic patterns as data labelling functions and pretrain a base model using the generated labels. Then we propose a syntactic and semantic-driven reinforcement learning algorithm, which can effectively generalize the base model to open situations with high accuracy. Experimental results show that our approach significantly outperforms the supervised counterparts, and can even achieve competitive performance to supervised state-of-the-art (SoA) model.


Introduction
Open information extraction (Open IE) aims to extract open-domain textual tuples consisting of a predicate and a set of arguments from massive and heterogeneous corpora (Sekine, 2006;Banko et al., 2007). For example, a system will extract a tuple (Parragon; operates; more than 35 markets) from the sentence "Parragon operates more than 35 markets and has 10 offices.". In contrary to the traditional IE, open IE is completely domain-independent and does not require the predetermined relations.
The neural open IE systems, unfortunately, rely on the large labelled corpus to achieve good performance, which is often expensive and labourintensive to obtain. Furthermore, open IE needs to extract relations of unlimited types from open domain corpus, which further exacerbates the need for large labelled corpus. Therefore, the labelled To resolve the labelled data bottleneck, this paper proposes a syntactic and semantic-driven learning approach, which can learn neural open IE models without any human-labelled data by leveraging syntactic and semantic knowledge as noisier, higherlevel supervisions. The motivation of our method is that, although tuple extraction is a hard task, its inverse problem -tuple assessment is easier to resolve by exploiting the syntactic regularities of relation expressions and the semantic consistency between a tuple and its original sentence. For example, Figure 2 shows the ARG1 "Parragon" and the ARG2 "more than 35 markets" follow the nsubj and dobj dependency structure, respectively. Meanwhile, the extracted tuple (Parragon; operates; more than 35 markets) has a high semantic similarity with its original sentence "Parragon operates more than 35 markets and has 10 offices.". And we found that the syntactic regularities can be effectively captured using syntactic rules, and the semantic consistency can be effectively modelled using the recent powerful pre-trained models such as BERT (Devlin et al., 2019).
Based on the above observations, we propose two learning strategies to exploit syntactic and semantic knowledge for model learning. Figure 1 illustrates the framework of our method. Firstly, syntactic open IE patterns are used as data labelling functions, and a base model is pretrained using the noisy training corpus generated by these labelling functions. Secondly, because the pattern-based labels are often noisy and with limited coverage, we further propose a reinforcement learning algorithm which uses syntactic and semantic-driven reward functions, which can effectively generalize the base model to open situations with high accuracy. These two strategies together will ensure the effective learning of open IE models: the data labelling function can pretrain a reasonable initial model so that the RL algorithm can optimize model more effec-tively; although the pattern-based labels are often noisy and with low coverage, the RL algorithm can generalize the model to open situations with high accuracy.
We conducted experiments on three open IE benchmarks: OIE2016 , WEB and NYT (Mesquita et al., 2013). Experimental results show that the proposed framework significantly outperforms the supervised counterparts, and can even achieve competitive performance with the supervised SoA approach. 1 The main contributions of this paper are: • We propose a syntactic and semantic-driven learning algorithm which can leverage syntactic and semantic knowledge as noisier, higherlevel supervisions and learn neural open IE models without any human-labelled data.
• We design two effective learning strategies for exploiting syntactic and semantic knowledge as supervisions: one is to use as data labelling functions and the other is to use as reward functions in RL. Experiments show that the two strategies are effective and can complement each other.
• Because labelled data bottleneck is common in NLP tasks, we believe our syntactic and semantic-driven learning algorithm can motivate the learning of other NLP models, such as event extraction, etc.

Syntactic and Semantic-driven Learning for Open IE
In this section, we describe how to learn neural open IE models without any human-labelled data. Two strategies are proposed to exploit syntactic and semantic knowledge as noisier, higher-level supervisions. Firstly, the syntactic patterns are used as data labelling functions for heuristically labelling a training corpus. Secondly, the syntactic and semantic coherence scores between the extracted tuples and their original sentences are used as reward functions for reinforcement learning. These two strategies together will ensure the effective learning of open IE systems: 1) although the labels generated by syntactic patterns are noisy and with limited coverage, they can pretrain a reasonable 1 Our source codes and experimental datasets are openly available at https://github.com/TangJiaLong/ SSD-OpenIE. initial model; 2) starting from the pretrained model, the syntactic and semantic-based reward functions provide an effective way to generalize our model to open situations.
In the following, we first introduce the neural networks used for open IE. Then we describe how to pretrain a base open IE model using syntactic patterns as data labelling functions. Finally, we generalize the base model using reinforcement learning with syntactic and semantic-driven rewards.

Neural Open IE Model
This paper uses RnnOIE neural networks, which have shown its simplicity and effectiveness for open IE (Stanovsky et al., 2018). But it should be noticed that our framework is not specialized to RnnOIE and can be used to train any neural open IE models.
RnnOIE formulates open IE as a sequence labelling task.
Given a sentence S = (w 1 , w 2 , ..., w m ), RnnOIE will first identify all verbs in S as predicates, such as "operates" and "has" for "Parragon operates more than 35 markets and has 10 offices.". For each predicate p, RnnOIE will: 1) first embed each word w i as , where e i is w i 's word embedding obtained by SoA pre-trained model BERT (Devlin et al., 2019), and I(w i = p) is an indicator vector which indicates whether w i is p; 2) then obtain contextual word representations using a stacked BiLSTM with highway connections (Srivastava et al., 2015;Zhang et al., 2016): H = (h 1 , h 2 , ..., h m ) = BiLST M (x 1 , x 2 , ..., x m ); 3) predict the probability of assigning label y i to a word w i using a fully connected feedforward classifier: P (ŷ i |S, p, w i ) = sof tmax(W h i + b); 4) finally decode the full label sequenceŶ using a beamsearch algorithm, e.g., RnnOIE will decode the label sequence [B-ARG 1 , B-P, B-ARG 2 , I- In open IE, all extracted tuples are ranked according to their confidence scores, which is important for downstream tasks, such as QA (Fader et al., 2011) and KBP . RnnOIE uses average log probabilities as the confidence of an extracted tuple: Given a training corpus, RnnOIE can be supervisedly learned by maximum log-likelihood esti-

Sentences Predicate Identification
Parragon operates more than 35 markets and has 10 offices.  Figure 3: An overview of syntactic patterns as data labelling functions. Two training instances are automatically generated using dependency pattern for predicates "operates" and "has". mation (MLE): where Y = (y 1 , y 2 , ..., y m ) are the gold labels. As discussed above, Y are expensive and labourintensive to obtain and have become the biggest bottlenecks for neural open IE systems. Therefore, it is critical to design a learning approach to get rid of this constraint.

Model Pretraining using Syntactic Pattern-based Data Labelling Functions
The first strategy is to use syntactic extraction patterns as data labelling functions, and then the heuristically labelled training corpus will be used to pretrain a neural open IE model. It has long been observed that most relation tuples follow syntactic regularity, and many syntactic patterns have been designed for extracting tuples, such as TEXTRUNNER (Banko et al., 2007) and ReVerb (Fader et al., 2011). However, it is difficult to design high coverage syntactic patterns, although many extensions have been proposed, such as WOE (Wu and Weld, 2010), OLLIE (Mausam et al., 2012), ClausIE (Corro and Gemulla, 2013), Standford Open IE , PropS  and OpenIE4 (Mausam, 2016).
This paper leverages the power of patterns differently. Inspired by the ideas of data programming (Ratner et al., 2016) and distant supervision (Mintz et al., 2009), we use syntactic patterns as data labelling functions, rather than to directly extracting tuples.
Concretely, this paper uses dependency patterns from Standford Open IE  to design hand-crafted patterns as data labelling functions. As shown in Figure 3, given a sentence and its dependency parse, two training instances are generated: 1) We first identify all its predicates using part of speech (POS) tags. For example, "operates" and "has" are identified. 2) For each predicate, we identify its arguments' headwords using predefined dependency patterns 2 . For example, "Parragon" and "markets" are extracted as the headwords. 3) For each headword, we extract the whole phrase headed to it as subject/object. For example, the phrase "more than 35 markets" headed to "markets" will be extracted as the object of "operates".
Finally, the generated labels are used to pretrain an open IE model by optimizing the objective function (2), which can provide a reasonable initialization for starting our RL algorithm in the next section.

Model Generalization via Syntactic and Semantic-driven Reinforcement Learning
One main drawback of the automatically generated labels is that they are often noisy and with limited coverage, i.e., many open relation tuples are not covered by the predefined patterns, and the dependency parse may contain errors which in turn will lead to noisy training instances. For example, in Figure 3 the training instance of the predicate "has" misses its subject "Parragon". Therefore, it is critical to generalize and refine the base model to open situations for good performance.
To this end, this section proposes the second learning strategy: syntactic and semantic-driven reinforcement learning. Specifically, we first measure the goodness of extracted tuples based on syntactic constraints using syntactic rules and semantic consistencies using pre-trained models such as BERT (Devlin et al., 2019). And then we generalize our model using the goodness of extractions as rewards in RL.
By modelling the extraction task as a Markov Decision Process (MDP), we have the following definitions: < S, A, T , R >: • A = {a} are actions used to indicate the target labels which are decided based on the current states S and the beam search strategy.
• T is the state transition function, which is related to the state update.
• R(Ŷ , S) is the reward function, which models the goodness of the extracted tuples. We will detailly describe our syntactic and semantic-driven reward function in the next paragraph.
Formally, the open IE model is trained to maximize the expected reward of the generated label sequencê Y using the REINFORCE algorithm with likelihood ratio trick (Glynn, 1990;Williams, 1992): where log P (Ŷ|S, p) denotes the probability of the generated label sequence.
Reward Function. The reward function, i.e., the goodness of extracted tuples, is critical in our RL algorithm. This paper estimates the reward R(Ŷ , S) by considering both syntactic constraint and semantic consistency: where Syn(Ŷ ) is the syntactic constraint score and Sem(Ŷ , S) is the semantic consistency score.
where 1 means the predicted label sequenceŶ is correct and -1 for incorrect. For semantic consistency, given an extracted relation and its original sentence, Sem(Ŷ , S) is computed as: where P (positive|Ŷ , S) is the semantic similarity between the predicted label sequenceŶ and its original sentence S. This paper estimates this semantic similarity using a BERT-based classifier, which assigns a similarity score to each sentencetuple pair. Because multiple tuples can be extracted from a single sentence (see Figure 3 for example), we train the classifier using the Stanford Natural Language Inference (SNLI) Corpus (Bowman et al., 2015), so that a high similarity score will be assigned if the original sentence entails the extracted tuple. This semantic consistency can provide useful supervision signals for open IE models. For example, because (Parragon; has; 10 offices) has higher semantic similarity than (has; 10 offices) to sentence "Parragon operates more than 35 markets and has 10 offices.", the model will be guided to more complete extractions.
Semantic-Based Confidence Estimation. In RnnOIE, the confidence score c(S, p,Ŷ ) is estimated only using extraction probabilities. This paper further considers the semantic consistency score for better confidence estimation: c (S, p,Ŷ ) = c(S, p,Ŷ ) + log(Sem(Ŷ , S)) (7) where the log is used for semantic consistency because c(S, p,Ŷ ) also uses log probabilities.

Experimental Settings
Datasets. We conduct experiments on three open IE benchmarks: OIE2016 , WEB and NYT (Mesquita et al., 2013). Table 1 shows their statistics. Because only OIE2016 provides training instances and it is the largest dataset, we use OIE2016 as the primary dataset. The WEB and NYT datasets are small and without training instances, therefore we use them for outof-domain evaluation. For OIE2016, we follow the settings in Jiang et al. (2019). For WEB and NYT, we follow the settings in Stanovsky et al. (2018).
• Supervised neural open IE systems, including RnnOIE-Supervised (Stanovsky et al., 2018) and RankAware (Jiang et al., 2019). Rn-nOIE is described in Section 2.1. RankAware is the state-of-the-art model in OIE2016 dataset, which uses iterative rank-aware learning for better confidence estimation.

Overall Results
Table 2 and Figure 4 shows the overall results. For our method, we use three settings: the first is the full model using the proposed syntactic and semantic-driven learning -RnnOIE-Full; the second is the base model which is not generalized using our reinforcement learning strategy -RnnOIE-Base; the third is our method with the base model trained using a gold-labelled corpus -RnnOIE-SupervisedRL. From Table 2 and Figure 4, we can see that: 3 1) The syntactic and semantic-driven learning approach can effectively resolve the training data bottleneck of neural open IE systems. In all three datasets, RnnOIE-Full significantly outperforms its supervised counterpart -RnnOIE-Supervised (BERT). On OIE2016, RnnOIE-Full can even achieve competitive performance with the supervised SoA model -RankAware. We believe this verifies the motivation of our method: the quality of extractions can be accurately evaluated using syntactic and semantic knowledge, and this knowledge can be effectively leveraged for the learning of open IE systems.  2) Syntactic pattern-based data labelling is an effective learning strategy. By generating training corpus, RnnOIE-Base achieves competitive performance on OIE2016 compared with its supervised counterpart -RnnOIE-Supervised (BERT) This verifies that the heuristically labelled dataset, although may noisy, can also provide a good start for building open IE systems. On the other side, we found noisy training corpus itself is not enough for high-performance open IE systems: in OIE2016 there is a 134% AUC gap (5.9 to 13.8) from RnnOIE-Base to RnnOIE-Full. This also verifies the need for further generalization techniques.

3) Syntactic and Semantic-driven RL is effective for generalize and refine open IE models.
Compared with RnnOIE-Base, RnnOIE-Full can get a 134% AUC improvement, from 5.9 to 13.8. By further generalizing the supervised RnnOIE-Supervised (BERT) baseline using RL, RnnOIE-SupervisedRL can further obtain a 121% AUC improvement, from 7.2 to 15.9. The above results verify the effectiveness of our RL algorithm, and this may be because a) the RL is based on the exploreand-exploit strategy, and the explore stage can consider many unseen cases; b) the syntactic and semantic knowledge is good supervision signals for open IE systems, and the syntactic and semanticaware rewards can effectively exploit these signals.

4) The RL-based generalization strategy is critical for scaling open IE systems to open
situations. In OIE2016, we can see that, although supervised systems can outperform pattern- based systems, their performance decreases significantly in out-of-domain WEB and NYT datasets. RnnOIE-Supervised (BERT) even perform worse than ClausIE and OpenIE4 on WEB and NYT. On the contrary, RnnOIE-Full can still achieve robust performance. This verifies the effectiveness of the proposed RL-based algorithm for generalizing to open situations. It is worth to notice that RnnOIE-Full even outperforms RnnOIE-SupervisedRL on out-of-domain datasets. The reason behind it may be: a) The gold-labelled corpus is useful in indomain situations (OIE2016). However, supervised base model may be overfitting and further affects the generalization process in RL. b) RnnOIE-Full   learns shallow linguistic features which are more general. Therefore it performs better in out-ofdomain situations (WEB and NYT).

Detailed Analysis
To analyze our method in detail, this section further investigates the effects of syntactic and semantic knowledge, semantic-based confidence estimation and RL exploration beam size. Additionally, we compare RnnOIE-Full with two open IE systems to find out how far can data labelling functions get us.

Effects of Syntactic and Semantic Knowledge.
Our reward function R(Ŷ , S) consists of both syntactic constraint Syn(Ŷ ) and semantic consistency Sem(Ŷ , S). To analyze the effects of syntactic and semantic knowledge, we conduct ablation experiments by removing the semantic part (w/o semantic) and removing the syntactic part (w/o syntactic) in reward function. Table 3 shows their performances on OIE2016.
We can see that: 1) both syntactic and semantic knowledge are useful: removing any of them will result in a performance decrease; 2) syntactic constraint is crucial for our model: removing it will result in a significant AUC decrease (from 13.8 to 3.0). This is because if we drop syntactic constraint Syn(Ŷ ), all explored relations will be treated as true, therefore our RL algorithm cannot rectify the wrong extractions.
Effect of Confidence Estimation. Table 4 shows the performance using different confidence estimation algorithms, including: Avg Log (average log probabilities) which is computed as Function 1, Semantic Consistency which is computed as Function 6 and Avg Log + Semantic Consistency.  We can see that: 1) the semantic evidence and the model prediction evidence are complementary to each other: Avg Log + Semantic Consistency obtains the best performance and gets 15% and 28% AUC improvements to Avg Log and Semantic Consistency; 2) Semantic Consistency can provide useful information for confidence estimation: Semantic Consistency itself can achieve comparable performance with Avg Log.
Effect of Beam Size. The beam size is an important hyper-parameter which controls the exploration breadth of our RL algorithm. Figure 5 shows the performance with different beam sizes.
We can see that: 1) an appropriate beam size is needed for generalizing open IE model. If the beam size is too small, RnnOIE-Full cannot explore new unseen cases because its explore strategy is too greedy; late2) The proposed RL algorithm is robust and achieves good performance with reasonable beam sizes (≥ 3). Because larger beam size will increase the computational complexity, we set beam size to 3 in all other experiments.
More Complex Data Labelling Functions. In section 2.2, we directly use dependency patterns from Standford Open IE  to design hand-crafted patterns as data labelling functions. It raises a question that, if we use more complex patterns as data labelling functions to obtain more diverse and accurate labelled data, is it still   (Cui et al., 2018) and SencseOIE (Roy et al., 2019), to find out how far can data labelling functions get us. From Table 5, we can see that: Cui et al. (2018) formulates open IE as a sentence generation task and uses OpenIE4 (Mausam, 2016)

Error Analysis.
We further conduct error analysis for RnnOIE-Full. We found there are mainly three types of error cases: Missing Argument, Overgenerated Predicate and Wrong Annotation. Table 6 shows their examples.
Missing Argument is the case where the extractions miss some arguments, especially for some optional arguments such as Time and Place in RnnOIE-Full. For instance, the first case in Table 6 shows the extraction for predicate "award" misses the optional time argument "in 1892", although it correctly contains two main arguments "DePauw University" and "the degree "Doctor of Divinity"". We found this maybe because optional arguments usually play a less important role in semantic consistency, our syntactic and semanticdriven RL algorithm will pay less attention to this generalization.
Overgenerated Predicate is the case where the predicates of extractions are not included in the ground truth. The second case in Table 6 shows a bad case where "Win" is wrongly extracted as the predicate. This is a common error in all neuralbased approaches because they generally treat all verbs in a sentence as predicates and do not have a mechanism to reject incorrect ones. One strategy to handle this error is to jointly detect predicates and arguments, which we leave as future work.
Incorrect Annotation is the case where the ground truth labels are incorrect. Because expressions in open IE are highly diversified, we found the gold annotations may be incorrect or inconsistent. The third case in Table 6 shows an incorrect ground truth annotation "given", which is wrongly labelled as a predicate. This further verifies the bottleneck of high-quality, large scale labelled corpus for open IE.

Related Work
Open IE. Open IE approaches can be mainly categorized into two categories: pattern-based and neural-based. Pattern-based open IE approaches extract relational tuples using syntactic patterns (Banko et al., 2007;Fader et al., 2011;Wu and Weld, 2010;Mausam et al., 2012;Mausam, 2016;Corro and Gemulla, 2013;; In recent years, neural-based approaches have achieved significant progress, which formulate open IE as either a sequence labelling task Stanovsky et al. Syntactic and semantic knowledge has also been leveraged to enhance open IE systems. Moro and Navigli (2013) design additional syntactic and semantic features to enhance their kernel-based open IE system. Roy et al. (2019) incorporate the outputs of multiple pattern-based Open IE systems as additional features to supervised neural open IE systems to overcome the problem of insufficient. Compared with these studies which exploit syntactic and semantic knowledge as additional features of a supervised system, this paper exploits syntactic and semantic knowledge as supervision signals, so that neural open IE models can be effectively learned without any labelled data.
Data Augmentation for NLP. The labelled data bottleneck is a common problem in NLP, therefore many data augmentation techniques have been pro-posed, such as data programming (Ratner et al., 2016), distant supervision (Mintz et al., 2009). Data programming paradigm (Ratner et al., 2016) creates training datasets by explicitly representing users' expressions or domain heuristics as a generative model. Distant supervision paradigm (Mintz et al., 2009) heuristically generates labelled dataset by aligning facts in KB with sentences in the corpus. The proposed data labelling functions are also motivated by the ideas of data programming and distant supervision.
Reinforcement Learning for IE. Reinforcement learning (RL) (Sutton and Barto, 1998) follows the explore and exploit paradigm and is apt for optimizing non-derivative learning objectives in NLP (Wu et al., 2018). Recently, RL has gained much attention in information extraction (Qin et al., 2018b,a;Takanobu et al., 2019). In open IE, Narasimhan et al. (2016) firstly using traditional Q-learning method to extract textual tuples. However, their reward function is chosen to maximize the final extraction accuracy which still relies on human-labelled datasets and can not capture the syntactic and semantic supervisions explicitly.

Conclusions
This paper proposes an open IE learning approach, which can learn neural models without any humanlabelled data by leveraging syntactic and semantic knowledge as noisier, higher-level supervisions. Specifically, two effective learning strategies are proposed, including the pattern-based data labelling functions and the syntactic and semanticdriven RL algorithm. Experimental results show that our method significantly outperforms supervised counterparts, and can even achieve competitive performance to supervised SoA model. Furthermore, because labelled data is a common bottleneck in NLP, we believe our syntactic and semanticdriven learning approach can also be used for other NLP tasks, such as event extraction, etc.