Multiˆ2OIE: Multilingual Open Information Extraction based on Multi-Head Attention with BERT

In this paper, we propose Multi^2OIE, which performs open information extraction (open IE) by combining BERT with multi-head attention. Our model is a sequence-labeling system with an efficient and effective argument extraction method. We use a query, key, and value setting inspired by the Multimodal Transformer to replace the previously used bidirectional long short-term memory architecture with multi-head attention. Multi^2OIE outperforms existing sequence-labeling systems with high computational efficiency on two benchmark evaluation datasets, Re-OIE2016 and CaRB. Additionally, we apply the proposed method to multilingual open IE using multilingual BERT. Experimental results on new benchmark datasets introduced for two languages (Spanish and Portuguese) demonstrate that our model outperforms other multilingual systems without training data for the target languages.


Introduction
Open information extraction (Open IE) (Banko et al., 2007) aims to extract a set of arguments and their corresponding relationship phrases from natural language text. For example, an open IE system could derive the relational tuple (was elected; The Republican candidate; President) from the given sentence "The Republican candidate was elected President." Because the extractions generated by open IE are considered as useful intermediate representations of the source text (Mausam, 2016), this method has been applied to various downstream tasks (Christensen et al., 2013;Ding et al., 2016;Khot et al., 2017;Wu et al., 2018).
Although early open IE systems were largely based on handcrafted features or fine-grained rules † Corresponding author  Figure 1: Comparison between existing extractors and the proposed method. We use BERT for feature embedding layers and as a predicate extractor. Predicate information is reflected through multi-head attention instead of simple concatenation. (Fader et al., 2011;Mausam et al., 2012;Del Corro and Gemulla, 2013), most recent open IE research has focused on deep-neural-network-based supervised learning models. Such systems are typically based on bidirectional long short-term memory (BiLSTM) and are formulated for two categories: sequence labeling (Stanovsky et al., 2018;Sarhan and Spruit, 2019;Jia and Xiang, 2019) and sequence generation (Cui et al., 2018;Sun et al., 2018;Bhutani et al., 2019). The latter enables flexible extraction; however, it is more computationally expensive than the former. Additionally, generation methods are not suitable for non-English text owing to a lack of training data because they are heavily dependent on in-language supervision (Ponti et al., 2019). Therefore, we adopted the sequence labeling method to maximize scalability by using (multilingual) BERT (Devlin et al., 2019) and multi-head attention (Vaswani et al., 2017). The main advantages of our approach can be summarized as follows: • Our model can consider rich semantic and contextual relationships between a predicate and other individual tokens in the same text during sequence labeling by adopting a multi-head at-tention structure. Specifically, we apply multihead attention with the final hidden states from BERT as a query and the hidden states of predicate positions as key-value pairs. This method repeatedly reinforces sentence features by learning attention weights across the predicate and each token (Tsai et al., 2019). Figure 1 presents the difference between the existing sequence labeling methods and the proposed method.
• Multi 2 OIE can operate on multilingual text without non-English training datasets by using BERT's multilingual version. By contrast, for sequence generation systems, performing zero-shot multilingual extraction is much more difficult (Rönnqvist et al., 2019).
• Our model is more computationally efficient than sequence generation systems. This is because the autoregressive properties of sequence generation create a bottleneck for real-world systems. This is an important issue for downstream tasks that require processing of large corpora.
Experimental results on two English benchmark datasets called Re-OIE2016 (Zhan and Zhao, 2020) and CaRB (Bhardwaj et al., 2019) show that our model yields the best performance among the available sequence-labeling systems. Additionally, it is demonstrated that the computational efficiency of Multi 2 OIE is far greater than that of sequence generation systems. For a multilingual experiment, we introduce multilingual open IE benchmarks (Spanish and Portuguese) constructed by translating and re-annotating the Re-OIE2016 dataset. Experimental results demonstrate that the proposed Multi 2 OIE outperforms other multilingual systems without additional training data for non-English languages. To the best of our knowledge, ours is the first approach using BERT for multilingual open IE 1 . The code and related resources can be found in https: //github.com/youngbin-ro/Multi2OIE.

Multi-Head Attention for Open IE
In sequence labeling open IE systems, when extracting arguments for a specific predicate, predicate-related features are used as input variables (Stanovsky et al., 2018;Zhan and Zhao, 2020;Jia and Xiang, 2019). We analyzed this extraction process from the perspective of multimodal learning (Mangai et al., 2010;Ngiam et al., 2011;Baltrusaitis et al., 2019), which defines an entire sequence and the corresponding predicate information as a modality. The most frequently used method for open IE is simple concatenation ( Figure  1, left), which can be interpreted as an early fusion approach. Simple concatenation has low computational complexity, but requires intensive feature engineering. It is also highly reliant on the choice of a classifier (Ergun et al., 2016;Liu et al., 2018).
Instead, we propose the use of a multi-modality mechanism (Tsai et al., 2019) to capture the complicated relationships between predicates and other tokens. In our method, multi-head attention is computed by using target modality as a query with source modalities as key-value pairs to adapt the latent information from sources to targets. This allows our model to assign greater weights to meaningful interactions between modalities. Accordingly, Multi 2 OIE uses multi-head attention to reflect predicate information (source modality) throughout a sequence (target modality). We expect this module to transform a general sentence embedding into a suitable feature for extracting the arguments associated with a specific predicate.

Multilingual Open IE
Despite the increasing amount of available web text in languages other than English, most open IE approaches have focused on the English language. For non-English languages, most systems are heavily reliant on handcrafted features and rules, resulting in limited performance (Zhila and Gelbukh, 2014;de Oliveira and Claro, 2019;Guarasci et al., 2020). Although some studies have demonstrated the potential of multilingual open IE (Faruqui and Kumar, 2015;Gamallo and Garcia, 2015;White et al., 2016), most approaches are based on shallow patterns, resulting in low precision .
Therefore, we introduce a multilingual-BERTbased open IE system. BERT provides languageagnostic embedding through its multilingual version and provides excellent zero-shot performance on many classification and labeling tasks (Pires et al., 2019;Wu and Dredze, 2019;Karthikeyan et al., 2020). In Section 5, we demonstrate that our multilingual system yields acceptable performance when it is trained using only an English dataset.

Multi-Head Attention Blocks
Argument Classifier Position embedding (predicate or not) BERT hidden sequence Predicate average

Proposed Method
Multi 2 OIE extracts relational tuples from a given sentence in two steps. The first step is to find all predicates in the sentence. The second step is to extract the arguments associated with each identified predicate. The architecture of the proposed model is presented in Figure 2.

Task Formulation
Let S = (w 1 , w 2 , ..., w l ) be an input sentence, where w i is the i-th token and l is the sequence length. The objective of the proposed model f is to find a set of tags T = (t 1 , t 2 , ..., t l ), where each element of T indicates one of the "beginning, inside, outside" (BIO) tags (Ramshaw and Marcus, 1995). However, unlike the method proposed in Stanovsky et al. (2018), which uses a predicate head as an input and predicts all tags simultaneously, we first predict a predicate tagset T pred = (t p 1 , t p 2 , ..., t p l ) using a predicate model f pred . An argument tagset T arg = (t a 1 , t a 2 , ..., t a l ) is predicted using f arg based on S andT pred . Therefore, our model maximizes the following log-likelihood formulation: where θ pred and θ arg are the trainable parameters of f pred and f arg , respectively. In this formulation, f pred contributes to extracting not only the predicates, but also the arguments. The loss and gradients derived from argument extraction are also propagated to θ pred and θ arg . Additionally, we treat open IE as an n-ary extraction task and consider BIO tags for arguments up to ARG3. We refer readers to Stanovsky et al. (2018) for a more detailed explanation of the BIO sequence labeling policy.

Predicate Extraction
We assume that a given sentence S is tokenized by SentencePiece (Kudo and Richardson, 2018). BERT embeds and encodes S through multiple layers. The final hidden states are defined as H ∈ R l×d , where d is the hidden state size of BERT. H is then fed into a feed-forward network and a softmax layer to calculate the probability that each token is classified into each predicate tag. The predicted tagsetT pred is obtained by applying the argmax operation to the softmax outputs. Finally, the loss for predicate extraction, denoted L pred , is calculated as per-token cross-entropy loss.

Argument Extraction
A sentence contains one or more predicates. The argument extraction method described in this section targets only one predicate. The process is simply repeated for multiple predicates.
Input representation The inputs for argument extraction are concatenations of the following three features: H,H pred , and E pos . The first feature is the same as the last hidden state of BERT, as discussed in Section 3.2. The second feature is the arithmetic mean vector of hidden states at predicate positions. We duplicate this vector to match the sequence length l and define it asH pred ∈ R l×d . We refer to the true tagset T pred to find the indices of predicates instead of using the predicted tagset T pred to achieve more stable training (Williams and Zipser, 1989). The final feature E pos is a position embedding of binary values that indicates whether each token is included in the predicate span. We then concatenate these three features to obtain the input X ∈ R l×d mh , where d mh = 2 · d + d pos is the dimension of multi-head attention and d pos is the dimension of the position embedding E pos .
Following concatenation, X is divided into a query and key-value pairs. We use X itself as a query, denoted as X q (target sequence). Key-value pairs, denoted as X k and X v (source sequence), are subsets of X derived from predicate positions.
Multi-head attention block The argument extractor consists of N multi-head attention blocks, each of which has a multi-head attention layer followed by a position-wise feed-forward layer, as shown in Figure 3.
The attention layer is the same as the encoderdecoder attention layer in the original transformer (Vaswani et al., 2017). It first transforms X q , X k , and X v into Q = X q W q , K = X k W k , and V = X v W v , respectively, where W q , W k , and W v are weight matrices with dimensions of (d mh × d mh ). Following transformation, the computation of attention is performed for each head as follows: Each head is indexed by h and has dimensions of d h = d mh n h , where n h denotes the number of heads. The attention outputs for each head are then concatenated and linearly transformed. In addition, we apply residual connections (He et al., 2016) and layer normalization (Ba et al., 2016) based on the results of prior works on transformers.
The position-wise feed-forward layer consists of two linear transformations surrounding a ReLU activation function. Residual connections and layer normalization are also applied in this layer. Finally, the output of the final multi-head attention block is fed into the argument classifier. The process for obtaining a predicted argument tagsetT arg and corresponding argument loss L arg is the same as that described in Section 3.2. The final loss for parameter updating is the summation of L pred and L arg .

Confidence Score
In open IE, confidence scores can help control the precision-recall tradeoff of a system. Multi 2 OIE provides a confidence score for every extraction by adding the predicate score and all argument scores, as suggested in Zhan and Zhao (2020). The score of the predicate and each argument is obtained from the probability value of the Beginning tag.
where the probability values are given by the softmax layer in each extraction step. Evaluation metrics We evaluated each system using the area under the curve (AUC) and F1-score (F1). AUC is calculated from a plot of the precision and recall values for all potential cutoffs.
The F1-score is the maximum value among the precision-recall pairs. We used the evaluation code provided with each test data, which contains the following matching functions: lexical match 3 for Re-OIE2016, and tuple match 4 for CaRB. Although the former only considers the existence of words within extractions, the latter is stricter in that it penalizes long extractions (Bhardwaj et al., 2019).
Hyperparameters Model hyperparameters were tuned by performing a grid search. We first trained the model for one epoch with an initial learning rate of 3e-5. The model contains four multi-head attention blocks with eight attention heads and a 64dimensional position-embedding layer. The batch size was set to 128. The dropout rates for the argument classifier and attention blocks were set to 0.2, respectively. AdamW (Loshchilov and Hutter, 2019) was used as an optimizer in combination with training heuristics, such as learning rate warmup (Goyal et al., 2017) and gradient clipping (Pascanu et al., 2013

Baselines
As baseline models, we selected RnnOIE (Stanovsky et al., 2018), SpanOIE (Zhan and Zhao, 2020), and a few custom systems to evaluate the validity of the multi-head attention blocks (MH). Although these are all sequence-labeling systems, note that SpanOIE uses the span selection method rather than BIO tagging. Table 2 presents a summary of the main baselines used in this study. We also report the results of the following systems developed prior to the use of neural networks: Stanford (Angeli et al., 2015), OLLIE (Mausam et al., 2012), PROPS , ClausIE (Del Corro and Gemulla, 2013), and OpenIE4. For these systems, the results were from previous studies (Zhan and Zhao, 2020;Bhardwaj et al., 2019).

Results
The performance results for each system on the Re-OIE2016 and CaRB test data are presented in Table 3. The precision-recall curves are presented in Figure 4. We also present extraction examples from Multi 2 OIE and SpanOIE in Table 4.
Overall performance Our model outperforms the other systems on all datasets and metrics. Our model yields average improvements of approximately 6.9%p and 2.9%p in terms of F1 for the Re-OIE2016 and CaRB datasets, respectively, compared to the state-of-the-art system (SpanOIE). Similar to previous studies (Stanovsky et al., 2018;Zhan and Zhao, 2020), the excellent performance of Multi 2 OIE is attributed to improved recall. As shown in Table 3, our method achieves the highest recall rate on both datasets. The examples in Table 4 also demonstrate that our model can extract more tuples from the same sentence. An additional tuple (debut; the newly solvent airline; its new image) is found by Multi 2 OIE, but not by SpanOIE. Additionally, Multi 2 OIE extracts the place information "At a ... hangar" for the first   tuple, which is omitted by SpanOIE.
Effects of multi-head attention We compared three pairs of methods to determine the validity of multi-head attention blocks: (BIO and BIO+MH), (SpanOIE and SpanOIE+MH), and (BERT+BiLSTM and Multi 2 OIE). As a result, except for BIO+MH yielding a lower AUC than BIO, the models with multi-head attention achieve higher performance than the BiLSTM-based models. This performance improvement is consistent, regardless of the choice of classification method (BIO tagging and span selection). These results suggest that the use of multi-head attention is superior to simple concatenation in terms of utilizing predicate information.
Additionally, the performance improvement from using MH is greater with BERT than with BiLSTM. The average performance improvements from BIO to BIO+MH are -0.5%p (AUC) and 1.1%p (F1), whereas the improvements from BERT+BiLSTM to Multi 2 OIE are 2.3%p (AUC) and 2.2%p (F1). This indicates that Multi 2 OIE has a model architecture that can create synergies between the predicate and argument extractors.
Computational cost We measured the training and inference times of each system to evaluate computational efficiency. As an additional baseline model, we considered a recently published sequence generation system called IMoJIE (Kolluru et al., 2020). It achieved state-of-the-art per-

Sentence
At a presentation in the Toronto Pearson International Airport hangar, Celine Dion helped the newly solvent airline debut its new image.
SpanOIE (helped; Celine Dion; the newly solvent airline debut its new image) Multi 2 OIE (helped; Celine Dion; the newly solvent airline debut its new image; At a presentation in the Toronto Pearson International Airport hangar) (debut; the newly solvent airline; its new image)  formance on the CaRB dataset using sequential decoding of tuples conditioned on previous extractions. For calculating inference times, we selected 641 sentences from the CaRB testing dataset and executed the models on a single TITAN RTX GPU. Table 5 reveals that Multi 2 OIE has much greater efficiency than IMoJIE. Our model only requires 15.5 s to process the 641 sentences, whereas IMo-JIE requires more than 3 min, which is a difference of approximately 14 times. This bottleneck of IMoJIE could be a drawback for downstream tasks, such as knowledge base construction, which must work with large amounts of text. Considering that the performance difference between the two models is only approximately 1%p 5 , it may be reasonable to use Multi 2 OIE to process large-scale corpora. Multi 2 OIE also exhibits competitive computational costs compared to the other sequencelabeling systems. Our model has similar training times compared to BERT+BiLSTM, but is faster for inference. This demonstrates that MH has a positive effect on both efficiency and performance. In the case of SpanOIE, its span selection method creates bottlenecks for both training and inference.

Multilingual Performance
As mentioned in Section 2.2, we trained a multilingual version of Multi 2 OIE using multilingual BERT and the same training dataset as the English version. We assumed that data for non-English languages were not available and tested  the model's zero-shot performance. Evaluations were conducted using a dataset generated based on the Re-OIE2016 dataset.

Experimental setup
Datasets Considering the availability of baseline systems, we selected Spanish and Portuguese as the evaluation dataset languages. First, all sentences, predicates, and arguments from the Re-OIE2016 6 dataset were translated into the target languages using Google 7 . To prevent adverse effects from translation errors, we modified the translated sentences to make sure that the back-translated sentences have the same meaning with the original sentence. After the translation and modification, we manually re-annotated all tuples of the target languages based on the English annotation of Re-OIE2016.
Evaluation metrics Because the baseline systems are binary extractors and do not provide confidence scores, we report binary extraction performance without AUC values. Additionally, although the introduced dataset was generated based on the Re-OIE2016, each system was tested using CaRB's evaluation code for more rigorous evaluation.
Baselines Our baseline models were two rulebased multilingual systems: ArgOE (Gamallo and Garcia, 2015) and PredPatt (White et al., 2016). The former takes dependency parses in the CoNLL-X format as inputs. Similarly, the latter uses   language-agnostic patterns of UD structures 8 .

Results
Comparison to the English model Prior to comparing the multilingual systems, we evaluated whether Multi 2 OIE's multilingual version exhibited a satisfactory performance for English compared to the English-only version. Table 6 lists the performance metrics for the English and multilingual versions of our model on the CaRB dataset. The performance of the English version was copied from Table 3. Although the multilingual version yields lower performance for both metrics compared to the English version, the F1 score is comparable and the recall is higher. Furthermore, the multilingual version still outperforms the other sequence-labeling systems, indicating that multilingual BERT can successfully construct a Multi 2 OIE model with favorable performance. Table 8 lists the performance metrics for each system for the multi-8 https://universaldependencies.org/ lingual dataset. Table 7 contains an example of Multi 2 OIE's extraction results for each language.

Multilingual performance
One can see that Multi 2 OIE outperforms the other systems on all languages. Similar to the results in Section 4.3, the superiority of our multilingual model is attributed to its high recall. Multi 2 OIE yields the highest recall for all languages by approximately 20%p. In contrast, ArgOE has relatively high precision, but low recall negatively impacts its F1 score. PredPatt provides the best balance of precision and recall, but the overall performance is lower than that of our model. The performance differences between languages are similar for all models. All models exhibit the best performance for English, followed by Spanish and Portuguese. Multi 2 OIE also exhibits performance degradation for non-English languages. However, considering that our model was never trained to perform open IE tasks on Spanish or Portuguese, its performance is remarkable. For some non-English sentences, our model extracts the same results as those extracted in the English extraction result, as shown in Table 7. This result agrees with the results of previous studies (Pires et al., 2019;Wu and Dredze, 2019;Karthikeyan et al., 2020), which have demonstrated the excellent cross-lingual abilities of multilingual BERT. Based on these results, we expect that Multi 2 OIE will also work well on languages other than those considered in this study.

Conclusion
In this paper, we propose Multi 2 OIE, which exploits BERT and multi-head attention for the open IE task. Multi-head attention has the advantage of fusing sentence and predicate features, which adequately reflect predicate information throughout a sentence. Our model achieved the best performance among sequence labeling models. Multi 2 OIE also exhibited superior computational efficiency with competitive performance compared to the stateof-the-art sequence generation systems. Additionally, a Multi 2 OIE model trained using multilingual BERT, outperformed the baseline models without training on any non-English languages.
However, some types of extractions, such as nominal relations, conjunctions in arguments, and contextual information, are not considered in Multi 2 OIE. Future work could investigate how to apply Multi 2 OIE to these cases. For multilingual open IE, performance evaluations and further study on non-alphabetic languages that were not considered in this study can be conducted.