From Dataset Recycling to Multi-Property Extraction and Beyond

This paper investigates various Transformer architectures on the WikiReading Information Extraction and Machine Reading Comprehension dataset. The proposed dual-source model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled - a newly developed public dataset, and the task of multiple-property extraction. It uses the same data as WikiReading but does not inherit its predecessor’s identified disadvantages. In addition, we provide a human-annotated test set with diagnostic subsets for a detailed analysis of model performance.


Introduction
The emergence of attention-based models has revolutionized Natural Language Processing (Young et al., 2018). Pretraining these models on large corpora like BookCorpus (Zhu et al., 2015) has been shown to yield a reliable and robust base for downstream tasks. These include Natural Language Inference (Bowman et al., 2015), Question Answering (Rajpurkar et al., 2016), Named Entity Recognition (Yadav and Bethard, 2018;Goyal et al., 2018;, and Property Extraction (Hewlett et al., 2016).
The creation of large supervised datasets often comes with trade-offs, such as one between the quality and quantity of data. For instance, the WikiReading dataset (Hewlett et al., 2016) has been created in such a way that WikiData annotations were treated as the expected answers for related Wikipedia articles. However, the above datasets were created separately, and the information content of both sources overlaps only partially. Hence, the resulting dataset may contain noise.
The best models can achieve results better than the human baseline across many NLP datasets such as MSCQAs (Wang et al., 2018), STS-B, QNLI (Raffel et al., 2020), CoLA or MRPC . However, as a consequence of different kinds of noise in the data, they rarely maximize the score metric (Stanislawek et al., 2019). While current work in NLP is focused on preparing new datasets, we regard recycling the current ones as equally important as creating a new one. Thus, after outperforming previous state-of-the-art on WikiReading, we investigated the dataset's weaknesses and created an entirely new, more challenging Multi-Property Extraction task with improved data splits and a reliable, human-annotated test set.
Contribution. The specific contributions of this work are the following. We analyzed the WikiReading dataset and pointed out its weaknesses. We introduced a Multi-Property Extraction task by creating a new dataset: WikiReading Recycled. Our dataset contains a human-annotated test set, with multiple subsets aimed to benchmark qualities such as generalization on unseen properties. We introduced a Mean-Multi-Property-F 1 score suited for the new Multi-Property Extraction task. We evaluated previously used architectures on both datasets. Furthermore, we showed that pretrained transformer models (Dual-Source RoBERTa and T5) beat all other baselines. The new dataset and all the models mentioned in the present paper were made publicly available on GitHub. 1

Related Work
Early work in relation extraction revolves around problems crafted using distant supervision methods, which are semi-supervised methods that automatically label pools of unlabeled data (Craven and Kumlien, 1999). In contrast, many QA datasets were created through crowd-sourcing, where annotators were asked to formulate questions with answers that require knowledge retrieval and information synthesis. One of the most popular QA datasets is Wikipedia-based SQUAD, where an instance consists of a human-formulated question, and an encyclopedic reading passage used to base the answer on (Rajpurkar et al., 2018). Another crowd-sourced dataset that profoundly influenced Natural Language Inference research is SNLI (Bowman et al., 2015)-a three-way semantics-based classification of a relation between two different sentences. Both SQUAD and SNLI are large-scale Machine Reading Comprehension (MRC) tasks, but they cannot be treated as Property Extraction as defined in Section 3; hence they are not considered in this paper. Similarly, some MRC problems framed in TREC tracks, such as Conversational Assistance or Question Answering, are beyond the scope of this paper (Dalton et al., 2020;Dang et al., 2007). Hewlett et al. (2016) proposed the WikiReading dataset that consists of a Wikipedia article and related WikiData statement. No additional annotation work was performed, yet the resulting dataset was of presumably high reliability. Nevertheless, we consider an additional human annotation to be desired (Section 4.3). Alongside the dataset, a property extraction task was introduced. The idea behind it is to read an article given a property name and to infer the associated value from the article. The property extraction paradigm is described in detail in Section 3, whereas a brief comparison to related datasets is presented in Table 1.
Initially, the best-performing model used placeholders to allow rewriting out-of-vocabulary words to the output. Next, Choi et al. (2017) presented a reinforcement learning approach that improved results on a challenging subset of the 10% longest articles. This framework was extended by Wang and Jin (2019) with a self-correcting action that removes the inaccurate answer from the answer generation module and continues to read.    hold the state-of-the-art on WikiReading with their proposition of SWEAR that attends over a sliding window's representations to reduce documents to one vector from which another GRU network generates the answer (Chung et al., 2014). Additionally, they evaluated a strong semi-supervised solution on a randomly sampled 1% subset of WikiReading.
To the best of our knowledge, no authors validated Transformer-based models on WikiReading and pretrained encoders.

Property Extraction
Let a property denote any query for which a system is expected to return an answer from given text. Examples include country of citizenship for a biography provided as an input text, or architect name for an article regarding the opening of a new building. Contrary to QA problems, a query is not formulated as a question in natural language but rather as a phrase or keyword. We use the term value when referring to a valid answer for the stated query. Some properties have multiple valid answers; thus, multiple values are expected. Examine the case of Johann Sebastian Bach's biography for which property sister has eight values. We will refer to any task consisting of a tuple (properties, text) for which values are to be provided as a property extraction task.
The biggest publicly available dataset for property extraction is WikiReading (Hewlett et al., 2016).
The dataset combines articles from Wikipedia with Wikidata information. The dataset is of great value; however, several flaws can be identified. First, more than 95% of articles in the test set appeared in the train set (Table 2). Second, the unjustifiably large size of the test set is a substantial obstacle for running experiments. For instance, it takes 50 hours to process the test set using a Transformer model such as T5 SMALL on a single NVidia V100 GPU. Finally, WikiReading assumes that every value in the test set can be determined on the basis of a given article. As shown later, this is not the case for 28% of values.

Towards Multi-Property Extraction
In the Multi-Property Extraction (MPE) scenario we propose, the system is expected to return values for multiple properties at once. Hence, can be considered a generalization of a single-property extraction task as it can be easily formulated as such. Thus, MPE is reverse-compatible with the single-property extraction, and it is still possible to evaluate models trained in the single-property setting.
Many arguments can be considered in favor of framing the problem as MPE. In a typical business scenario, multiple properties are expected to be extracted from a given document. The bulk inference requires a lower computational budget by a factor proportional to the mean number of properties per article, which makes MPE preferable. Moreover, one can expect that systems trained in such a way will manifest emergent properties resulting from the interaction between properties themselves. Consider the set of property-value pairs: date of birth: 1915-01-12, date of death: 1979-05-02, place of birth: Saint Petersburg already predicted by an autoregressive model. It is in principle possible to answer: country of citizenship: Russian Empire, country of citizenship: Soviet Union using the earlier predicted pairs only. This phenomenon emerges if the model (or person) learned the relationships between years, administrative boundaries of the city, and the transformation of the Russian Empire into a communist state that occurred in the meantime. Although no such reasoning is required and the problem can be solved by memorizing related co-occurrence patterns, we intend to achieve the mentioned emergent properties.

WikiReading Recycled: Novel Dataset for Multi-Property Extraction
The comparison to existing datasets and shared tasks is briefly presented in Table 1, whereas Table 3 focuses on selected differences between WikiReading Recycled and WikiReading.

Desiderata
Our set of desiderata is based on the following intentions. We wished to introduce the problem of Multi-Property Extraction to evaluate systems that extract any number of given properties at once from the same source text. Our second objective was to ensure that an article may appear in precisely one data split. The third core intention was to introduce an article-centered data objective instead of a property-centric one. Note that an instance of data should be an article with multiple properties. The fourth objective was to ensure that all properties in the test set can be extracted or inferred.
The fifth was to keep the validation and test sets within a reasonable size. Moreover, we aim to provide a test set of the highest quality, lacking noise that could arise from automatic processing. Finally, we intended to benchmark the model generalization abilities -the test set contains properties not seen during training, posing a challenge for current state-of-the-art systems.

Data Collection and Split
The WikiReading Recycled and WikiReading are based on the same data, yet differ in how they are arranged. Instances from the original WikiReading dataset were merged to produce over 4M samples in the MPE paradigm. Instead of performing a random split, we carefully divide the data assuming that 20% of properties should appear solely in the  test set (more precisely, not seen before in train and validation sets). Around one thousand articles containing properties not seen in the remaining subsets were drafted to achieve the mentioned objective. Similarly, properties unique for the validation set were introduced to enable approximation of the test set performance without disclosing particular labels. Additionally, test and validation sets share 10% of the properties that do not appear in the train set, increasing the size of these subsets by 2,000 articles each. Another 2,000 articles containing the same properties as the train set were added to each of the validation and test sets. All the remaining articles were used to produce the training set.
To sum up, we achieved a design where as much as 50% of the properties cannot be seen in the training split, while the remaining 50% of the properties can appear in any split. We chose these properties carefully so that the size of the test and validation sets does not exceed 5,000 articles.

Human Annotation
The quality of test sets plays a pivotal role in reasoning about a system's performance. Therefore, a group of annotators went through the instances of the test set and assessed whether the value either appeared in the article or can be inferred from it. To make further analysis possible, we provide both datasets, before (test-A) and after (test-B) annotation.
The annotation process was non-trivial due to vagueness of the inferability definition, and the scientific character of the considered text. It was required to understand advanced encyclopedic articles e.g., about chemistry, biology, or astronomy, to answer domain-specific properties (scientific classifications or biological taxonomy), which are only possible with deep knowledge about the world and with the ability to learn during the process. Moreover, linguistic skills were required to transliterate and transcribe first and last names. Note that we consider the value which appears in a different writing script as inferable. Due to the stated issues, we decided to rely on highly trained linguists as annotators.
The process was supported by several heuristics. In particular, the approximate string matching was used to highlight fragments of presumably high importance. Nevertheless, it took seven linguists more than 100 hours in total to complete. On average, two minutes and thirty second were required to verify data assigned to one Wikipedia article.
The relevance of annotation mentioned above can be demonstrated by the fact that 28% of the property-value pairs were marked as unanswerable and removed. As it will be shown later, the Mean-Multi-Property-F 1 on a pre-verified test-A was approximately 20 points lower, and 8% of articles were removed entirely from the test-B during the annotation process.

Diagnostic Subsets
We determined auxiliary validation subsets with specific qualities, not only to help improve data analysis but also to provide additional information at different stages of development of a system. The qualities we measure and the definition is provided below.
Rare, unseen. Rare and unseen properties were distinguished depending on their frequency. The number of occurrences in the train set was below a threshold of 4000 for each in rare and was precisely 0 for the unseen category.
Categorical, relational. We denote a property as categorical if its value set contains a limited number of values; otherwise, it is relational. We apply normalized entropy with a threshold of 0.7 to obtain properties that belong to the categorical subset. For instance, the continent property occurs 20060 times, but with 13 possible values, its normalized entropy equals 0.43; hence it is marked as categorical. This splitting method is not ideal, but we wanted to use the same method as in (Hewlett et al., 2016). For example, if the distribution of continents was uniform, the property would have been classified as relational. However, in practice, it almost never happens.
Exact match. The exact match category applies to cases where expected value is mentioned directly in the source text.
Long articles. Instances with articles longer than 695 words (threshold qualifying to the top 15% longest articles in the train set) constitute the long articles diagnostic set.
Characteristics of different systems can be compared qualitatively by evaluating on these subsets. For instance, the long articles subset is challenging for systems that consume truncated inputs. Unseen is precisely constructed to assess systems' ability to extract previously not seen properties. On the other hand, rare can be viewed as an approximation of the system's performance on a lower-resource downstream extraction task. The categorical subset is useful in assessing approaches featuring a classifier, whereas it is suboptimal to use such systems for relational due to richer output space. Similarly, the exact match can be approached with sequence tagging solutions. The share of each diagnostic subset is presented in Table 4.

Model Architectures
We evaluate different model architectures on the WikiReading Recycled dataset.
We reimplemented the previously best performing WikiReading model, finetuned pretrained Transformer models, and applied a dual-source model. Their competitiveness can be demonstrated by the fact that we were able to outperform the previous state-of-the-art on the WikiReading by a far margin.
Basic seq2seq. A straightforward approach to single-property extraction is to use an LSTM sequence-to-sequence model where the input consists of a property name concatenated with the considered input text. To compare with the previous results, we reproduced the basic sequence-tosequence model proposed by Hewlett et al. (2016).
Vanilla Transformer. A more up-to-date solution is to use the Transformer architecture (Vaswani et al., 2017) instead of an RNN, and a subword tokenization method, such as unigram LM tokenization (Kudo, 2018). We use the term vanilla to denote a model trained from scratch.
Vanilla Dual-Source Transformer. The Transformer architecture was extended to support two inputs and successfully applied in Automatic Post-Editing . We propose to reuse this Dual-Source Transformer architecture in the property extraction tasks. The architecture consists of two encoders that share parameters and a single decoder. Moreover, the encoders and decoder share embeddings and vocabulary. In our approach, the first encoder is fed with the text of an article, and the second one takes the names of properties (Figure 1). The model is trained to generate a sequence of pairs: (property, value) separated with a special symbol.
Dual-Source RoBERTa. Recent research shows that pretrained language models can improve performance on downstream tasks (Radford et al., 2018). Therefore, we experimented with the pretrained RoBERTa language model as an encoder. RoBERTa models were developed as a hyperoptimized version of BERT with a byte-level BPE and a considerably larger dictionary (Liu et al., 2019;Devlin et al., 2019). All the model parameters, including the RoBERTa weights, were further optimized on the WikiReading Recycled task.
T5. Recently proposed T5 model (Raffel et al., 2020) is a Transformer model pretrained on a cleaned version of CommonCrawl. T5 is famous for achieving excellent performance on the Super-GLUE benchmark . To create a model input, we concatenate a property name and an article. In the case of MPE, we reduce the dataset to the single property setting, as  used by the T5 model's authors.

Evaluation
In this section, we describe the evaluation of previously proposed architectures on both WikiReading and WikiReading Recycled datasets. We would like to highlight that the results are not comparable between the two datasets, as they are based on different train/validation/test splits.

Metrics
The performance of systems is evaluated using the F1 metric, adapted for the WikiReading Recycled format. For WikiReading, Mean-F 1 follows the originally proposed micro-averaged metric and assesses F1 scores for each property instance, averaged over the whole test set. Let E denote a set of expected property-value pairs and O model-generated property-value pairs. Assuming |·| stands for set cardinality, precision and recall can be formulated as follows: Then F 1 is computed as a harmonic mean: Given a sequence E = {E 1 , E 2 , .., E n } of expected answers for n test instances, and associated sequence of predictions O = {O 1 , O 2 , .., O n }, we calculate Mean-F 1 as: In WikiReading Recycled, we adjust the metric to handle many properties in a single test instance.
To do that, the E i and O i sets contain values from many properties at once and n is equal to the number of articles. Note that in the case of the M-F 1 properties are considered as instances. We call our article-centric metric Mean-Multi-Property-F 1 or in short MMP-F 1 .

Training Details
Since the basic seq2seq model description missed some essential details, they had to be assumed before model training. For example, we supposed that the model consisted of unidirectional LSTMs and truecasing was applied to the output. The rest of the parameters followed the description provided by the authors. An extensive hyperparameter search was conducted for both Dual-Source Transformers on the WikiReading Recycled task. In the case of the Dual-Source Transformer evaluated on WikiReading we restricted ourselves to hyperparameters following the default values specified in the Marian NMT Toolkit . The only difference was the reduction of encoder and decoder depths to 4.
For the Vanilla Dual-Source Transformer evaluation, both WikiReading and WikiReading Recycled datasets were processed with a SentencePiece model (Kudo, 2018) trained on a concatenated corpus of inputs and outputs with a vocabulary size of 32,000. Dual-Source RoBERTa model is initialized with RoBERTa BASE (consisting of 12 encoder layers and a dictionary of 50,000 subword units).
In the case of the T5 model, we keep hyperparameters as close as possible to those used during pretraining. The training continues with restored AdaFactor parameters. We finetuned the small version of the model in a supervised-only manner.
We truncate the input to the first 512 tokens for all our models.

Model
Mean-F 1 Basic s2s (Hewlett et al., 2016) 70.8 Placeholder s2s (Choi et al., 2017) 75.6 SWEAR  76.8  using the Tree-structured Parzen Estimator algorithm (Bergstra et al., 2011) with additional heuristics and Gaussian priors resulting from the default settings proposed for this sampler in the Optuna framework (Akiba et al., 2019). An evaluation was performed every 8,000 steps, and the validationbased early stopping was applied when no progress was achieved in 3 consecutive validations. The total number of 250 trials was performed for each architecture. Intermediate results of each trial were monitored and used to ensure only the top 10% trials were allowed to continue. Details of the hyperparameter optimization are presented in Appendix A.

Results on WikiReading
Although the main focus of our evaluation is the WikiReading Recycled dataset; we additionally evaluate whether the Vanilla Dual-Source Transformer can improve the state-of-the-art on WikiReading.
We reproduced the Basic seq2seq model. It achieved a Mean-F 1 score of 74.8, which is 4 points higher than reported by Hewlett et al. (2016). The difference may be caused by poor optimization in the original work. Our dual-source solution achieves 82.4 and outperforms the previous stateof-the-art model by 5.6 Mean-F 1 points. To measure the impact of using two encoders instead of one, we evaluated the Vanilla Single-source Transformer, which takes a concatenated pair of article and property as its input. Our dual-source model outperformed its single-source counterpart by 3.1 points. Table 6 presents the final results.

Results on WikiReading Recycled
The results on WikiReading show that the Dual-Source Transformer is beneficial to the Property Extraction task. On WikiReading Recycled, we supplement the evaluation with pretrained models: Dual-Source RoBERTa and T5. Table 7 presents Mean-Multi-Property-F 1 scores on the annotated test set (test-B). All the transformer-based models outperform the Basic seq2seq. The Dual-Source Transformer achieved 77.5 Mean-Multi-Property-F 1 . Its pretrained version, Dual-Source RoBERTa, improves the result by 1.4 points. As the T5 model beats the Vanilla Dual-Source Transformer, we may conclude that even though the WikiReading Recycled dataset is very large, the pretraining is crucial for this MPE task. It is worth remembering that the results on WikiReading and WikiReading Recycled are not comparable due to the dissimilarities in metrics and datasets. We will elaborate on that in section 7.

Discussion and Analysis
The final scores of transformer-based models differ slightly on WikiReading Recycled. In order to get more insight, we analyze the models on diagnostic sets described in Section 4.4.
Impact of Property Frequency. We provide two diagnostic sets related to property frequency: unseen and rare. Both dual-source models failed on the unseen subset. These models ignored the unseen properties from the input and did not generate any answer. The best result was achieved by the T5 model (10.9 points), albeit it still does not meet expectations.
The results on the rare subset show that the pretraining makes a difference if properties are infrequent in the train set ( Figure 2).   Impact of Property Type. The extraction of some properties may be treated as a classification task since the set of their valid values is limited. In this case, all models perform similarly and achieve approximately 85 Mean-Multi-Property-F 1 . The difficulty of the task increases proportionally to the normalized entropy value, which may lead to the divergence of model performances. This phenomenon is visible in the case of our Basic seq2seq, where the weakness is evident above the 0.5 threshold. The details are presented in Figure 3. Exact Match and Long Articles. The results from the exact match and long articles subsets are correlated with the scores attained on the test-B set; however, the absolute values achieved differ substantially. This is because the long article subset is more challenging, as the chance of an answer appearing in the constant-length prefix decreases with the length of the article. The use of recently introduced models like LongFormer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) might decrease the gap in scores between long and average-length articles. On the other hand, system performance should increase when the answer is provided directly in the text, as can be found in the exact match subset. This considerable decrease in score shows that the WikiReading Recycled test-A set is more difficult than WikiReading. The reason behind this is that we removed leakage of articles between splits, and we also added more infrequent properties that are harder to answer.
Impact of Human Annotation. The Vanilla Dual-Source Transformer was evaluated on both WikiReading Recycled test sets. It obtained Mean-Multi-Property-F 1 of 62.6 on the non annotated test-A set, while achieving 77.5 on the annotated test-B. This discrepancy suggests that the linguists indeed succeeded to remove non-inferable properties. We anticipate that cleaning the train set in a similar fashion could improve the stability of the training and the overall results.

Summary
We introduced WikiReading Recycled-the first Multi-Property Extraction dataset with a humanannotated test set. We provided strong baselines that improved the current state-of-the-art on WikiReading by a large margin. The bestperforming architecture was successfully adapted from Automatic Post-Editing systems. We show that using pretrained language models increases the performance on the WikiReading Recycled dataset significantly, despite its large size. Additionally, we created diagnostic subsets to qualitatively assess model performance. The results on a challenging subset of unseen properties reveal that despite high overall scores, the evaluated systems fail to provide satisfactory performance.
Low scores indicate an opportunity to improve, as these properties were verified by annotators and are expected to be answerable. We look forward to seeing models closing this gap and leading to remarkable progress in Machine Reading Comprehension.
The dataset and models, as well as their detailed configurations required for reproducibility, are publicly available.