REflex: Flexible Framework for Relation Extraction in Multiple Domains

Systematic comparison of methods for relation extraction (RE) is difficult because many experiments in the field are not described precisely enough to be completely reproducible and many papers fail to report ablation studies that would highlight the relative contributions of their various combined techniques. In this work, we build a unifying framework for RE, applying this on three highly used datasets (from the general, biomedical and clinical domains) with the ability to be extendable to new datasets. By performing a systematic exploration of modeling, pre-processing and training methodologies, we find that choices of preprocessing are a large contributor performance and that omission of such information can further hinder fair comparison. Other insights from our exploration allow us to provide recommendations for future research in this area.


Introduction
Relation Extraction (RE) has gained a lot of interest from the community with the introduction of the Semeval tasks from 2007 by (Girju et al., 2007) and 2010 by (Hendrickx et al., 2009). The task is a subset of information extraction (IE) with the goal of finding semantic relationships between concepts in a given sentence, and is an important component of Natural Language Understanding (NLU). Applications include automatic knowledge base creation, question answering, as well as analysis of unstructured text data. Since the introduction of RE tasks in the general and medical domains, many researchers have explored the performance of different neural network architectures on the datasets (Socher et al., 2012;Zeng et al., 2014;Liu et al., 2016b;Sahu et al., 2016).
However, progress in RE is hampered by reproducibility issues as well as the difficulty in assessing which techniques in the literature will generalize to novel tasks, datasets and contexts. To assess the extent of these problems, we performed a manual review of 53 relevant neural RE papers 1 citing the three datasets (Hendrickx et al., 2009;Segura-Bedmar et al., 2013;Uzuner et al., 2011). The procedure for finding these papers is highlighted in (Chauhan, 2019).
Reproducibility Reproducibility is important for validating previous work and building upon it (Fokkens et al., 2013). Lack of reproducibility can be attributed to many factors such as difficulty in availability of source code (Ince et al., 2012) and omission of sources of variability such as hyperparameter details (Claesen and De Moor, 2015). We found that only 16 out of the 53 relevant papers had released their source code. 14 out of 53 papers were evaluated on multiple datasets, but the source code was publicly available for only five of those. Despite this, much of this code was lacking in modularity to be easily extendable to new datasets. In many cases, the process of reproducing the paper results was often unclear and lack of documentation made this more difficult. Even though most papers mentioned some hyperparameter details, important details were missing such as number of epochs, batch size, random initialization seed, if any, and details about early stop if that technique was applied.
Ablation Studies Lack of generalizability is caused by a dearth of appropriate empirical evaluation to identify the source of modeling gains. Ablation studies are important for identifying sources of improvements in results. Among the 53 papers that we looked at, 20 of the 24 papers in the general domain performed ablation studies. However, only 10 out of 29 papers in the medical domain performed one. Among these ablation studies, key details related to pre-processing were missing, which we found critical in our experiments.
In the absence of such information about causes of large variability of results, fair comparison of models becomes difficult. In this paper, we present an open-source unifying framework enabling the comparison of various training methodologies, pre-processing, modeling techniques, and evaluation metrics. The code is available at https: //github.com/geetickachauhan/ relation-extraction.
The experimental goals of this framework are identification of sources of variability in results for the three datasets and provide the field with a strong baseline model to compare against for future improvements. The design goals of this framework are identification of best practices for relation extraction and to be a guide for approaching new datasets.
By performing systematic comparison on three datasets, we find that 1) pre-processing choices can cause the largest variations in performance, 2) reporting scores on one test set split is problematic due to split bias. We perform other analyses in section 5 and also include recommendations for future research in this field in section 7.
Upon testing various combinations of our approaches, we achieve results near state of the art ranges for the three datasets: 85.89% macro F1 for Semeval 2010 task 8 dataset (Hendrickx et al., 2009) i.e. semeval, 71.97% macro F1 for DDI Extraction 2013 (Segura-Bedmar et al., 2013) i.e. ddi and 71.01% micro F1 for i2b2/VA 2010 relation classification dataset (Uzuner et al., 2011) i.e. i2b2. We refer to ddi and i2b2 as medical datasets, as they belong to the biomedical and clinical domains, respectively.  Table 1: Dataset information, with columns Rel = number of relations, Eval = evaluation metric (all F1 scores), Agreement = Inter-annotator agreement, Det = whether detection task from section 3.4 was evaluated on. Rel column only includes relations used in official evaluation metric. ddi was built from two separately annotated sources and therefore contains two inter-annotator agreements.

Datasets
We summarize important information about these datasets in table 1. We introduce detection and classification tasks in section 3.4, but also indicate the tasks evaluated for each dataset in table 1.
Semeval 2010 semeval consists of 8000 training sentences and 2,717 test sentences for the multi-way classification of semantic relations between pairs of nominals. Not included in the official evaluation is an Other class which is considered noisy, with annotators choosing this class if no fit was found in the other classes. It is important to note that this is a synthetically generated dataset, and detection scores were not calculated due to the noisy nature of the Other class.

Methodology
Our framework breaks up processing into different stages, allowing for future modular addition of components. First, a formatter converts the raw dataset into a common comma separated value (CSV) input format accepted by the pre-processor, and this information is then fed to the model, which performs the training, after which evaluation is performed on the test set. With our framework, we test the following variations in the main components:

Pre-Processing
We test various pre-processing methods after performing simple tokenization and lower-casing of the words: entity blinding used by Liu et al. (2016b), stop-word and punctuation removal, and digit normalization commonly applied for ddi in (Zhao et al., 2016), and named entity recognition related replacement (we call this NER blinding). We used the spaCy framework 2 for tokenization and to identify punctuation and digits. Entity blinding and NER blinding are similar concept blinding techniques where the first is performed based on gold standard annotations, while the second is performed by running NER on the original sentence. We replace the words in the sentence matching the entity or named entity span with the target label and use those for training and testing.
Entity labels for semeval were not annotated with type information, whereas ddi identified drugs and i2b2 identified medical problems, tests and treatments. Therefore, entity labels for semeval were ENTITY, for ddi were DRUG and for i2b2 were PROBLEM, TREATMENT and TEST. In this paper, we use fine-grained concept type to refer to the presence of more than one concept type, as in the the case of i2b2.
NER labels for semeval consisted of those provided by the large english model by spaCy and provided standard types such as PERSON and ORGANIZATION, whereas those for the medical datasets was provided by the ScispaCy medium size model and did not provide types (Neumann et al., 2019). In this case, blinding consisted of replacing the words in the sentence by Entity.
We chose the spaCy model for NER to complement the extendable design goals of REflex. Other options such as cTAKES (Savova et al., 2010) for clinical data and MetaMAP 3 for biomedical data are highly specific to the dataset type and require running additional scripts outside of the REflex pipeline.

Modeling
We employ a baseline model based upon (Zeng et al., 2014), (Santos et al., 2015 and (Jin et al., 2018), which is a convolutional neural network (CNN) with position embeddings and a ranking loss (referred to as CRCNN in this paper). We initialize the model with pre-trained word embeddings: the senna embeddings by Collobert et al. (2011) for the general domain dataset and the PubMed-PMC-wikipedia embeddings released by Pyssalo et al. (2013) for the medical domain. We test several perturbations on top of CRCNN model, such as piecewise max-pooling, as suggested by Zeng et al. (2015) and the more recent ELMo embeddings by Peters et al. (2018). To compare different featurizations of contextualized embeddings, we also employ the embeddings generated by the BERT model (rather than the standard fine-tuning approach). For ELMo, we use the Original (5.5B) model weights in semeval and PubMed contributed model weights in the medical datasets released by (Peters et al., 2018). For BERT, we use the BERT-large uncased model (without whole word masking) in semeval released by (Devlin et al., 2018), BioBERT by (Lee et al., 2019) in ddi and Clinical BERT by (Alsentzer et al., 2019) in i2b2.
The fine-tuning approach, which tends to be computationally expensive, has been thoroughly explored for multiple tasks, including medical relation extraction by Lee et al. (2019), but the approach of featurizing them with an existing model has not been explored in the literature as much. We tested different ways of featurizing the BERT contextualized embeddings for researchers who want to utilize a less computationally intensive technique, while still aiming for performance gains for their task.
Because ELMo provides token level embeddings, we chose to concatenate them with the word and position embeddings from CRCNN before the convolution phase. However, BERT provides word-piece level as well as sentence level embeddings. The first was concatenated similar to ELMo (which we call BERT-tokens), while the second was concatenated with the fixed size sentence representation outputted after convolution of word and position embeddings (BERT-CLS).

Training
We explore two ways of doing hyperparameter tuning: manual tuning and random search (Bergstra and Bengio, 2012).
Evaluating on three datasets meant that we needed to identify a default list of hyperparameters by tuning on one of the datasets before we could identify the hyperparameter list for the other two. We chose semeval for initial tuning due to its larger literature and because the CRCNN model was originally evaluated on this dataset. We started with reference hyperparameters listed in Zeng et al. (2014) and Santos et al. (2015) and identified default hyperparameters after tuning on a dev set randomly sampled from the training data of the semeval dataset. These default hyperparameters 4 were used as starting points for manual tuning on the medical datasets as well as random search for all datasets.
We perform manual tuning on a subset of the hyperparameters, mentioned in table 2. In order to avoid overfitting in cross validation pointed out by Cawley and Talbot (2010), we perform a nested cross validation procedure, keeping a dev fold for hyperparameter tuning and a held out fold for score reporting.
On these dev folds, we perform paired t-tests for each of the perturbations to the parameters listed in table 2. Our first pass involves changing one hyperparameter per experiment and noting the ones that cause a statistically significant improvement, which helps us identify a narrower list of hyperparameters to tune on. We further refine the hyperparameter values in our second pass by testing on values similar to those that were leading to statistically significant improvements in the first pass. For example, if we noticed that lower epoch values were helpful in the first pass, we tested them in combination with the other optimal hyperparameter values (from first pass) in the second pass.
For each of the datasets, we tuned based on their official challenge evaluation metrics listed in section 2. ddi and i2b2 had 5-fold nested cross validation performed on them, whereas semeval had 10-fold cross validation performed.
Random search was performed based on the official evaluation metrics for each dataset, on a fixed dev set randomly sampled from the training data. Final distributions are listed in table 3.

Evaluation
The official challenge problems for all datasets compared models based on multi-class classification, but for the medical datasets, we were also interested in looking at the changes in model performance if we treated the task as a binary classification problem. This was based on the rationale that in the drug literature, for example, pharmacologists would not want to sacrifice the ability to identify a potentially life threatening drug interaction pair, even if the type of the drug pair is not known. Therefore, we report results for both multi-class and binary classification scenarios. For clarity, we refer to them in the rest of the paper as classification and detection respectively.

listed in source code
Detection results were obtained using our evaluation scripts by treating existing relations as one class, ignoring the types outputted by the model. The other class in this task was the None or Other class, representing non-existing relations. Note that we did not re-train our model for this.
In addition to evaluating on two tasks for the medical and one task for the general dataset, we comment on the implications of different evaluation metrics in section 5.5.

Results
For experiments on the medical datasets i.e. i2b2 and ddi, we used hyperparameters found from manual search individually performed on them. semeval had the default hyperparameters used for its experiments. These sets of hyperparameters were used in all experiments other than those reported in table 6, where we compare hyperparameter tuning methodologies.
Once we had a fixed set of hyperparameters for each dataset, we tested the perturbations for preprocessing as well as modeling in tables 4 and 5. Perturbations on the hyperparameter search are listed in table 6 and compare performance with different hyperparameter values found using different tuning strategies.
We generate the standard classification and the additional detection scores by the procedure described in section 3.4, and report these results under the Class and Detect columns.
We also report additional experiments in tables 7 and 8 based on the improvements found in tables 4 and 5. For all results tables, we report official test set results at the top, with accompanying cross validated results (averaged over all folds with their standard deviation) in smaller font below them. 5

Discussion
Recently, CNNs have achieved strong performance for text classification and are typically more efficient than recurrent architectures (Bai et al., 2018;Kalchbrenner et al., 2014;Wang et al., 2015;Zhang et al., 2015b). The speed of our baseline CRCNN model allows us to explore multiple alternatives for every stage of our pipeline. We discuss these results pertaining to the classification task for all datasets and the detection task for   the medical datasets.

Pre-processing
Often, papers fail to mention the importance of pre-processing in performance improvements. Experiments in table 4 reveal that they can cause larger variations in performance than modeling. We applied pre-processing changes with the CRCNN model with default hyperparameters for semeval and manual hyperparameters for the medical datasets.
All comparisons are performed against the original pre-processing technique, which involved using the original dataset sentences in training and test.
Punctuation and digits hold more importance for the ddi dataset, which is a biomedical dataset, compared to the other two datasets. We looked at examples where this technique led to an incorrect prediction, but original pre-processing led to a correct one to investigate the source of performance further. The examples indicate that removal of punctuation is driving worse performance compared to the normalization of digits. A detailed analysis for these is present in (Chauhan, 2019).
Stop word removal is a common technique in Natural Language Processing (NLP) to simplify the sentence by cutting out commonly used words such as the and is in order to simplify the sentence. We found that stop words seem to be important for relation extraction for all three datasets that we looked at, to a smaller degree for i2b2 compared to the other two datasets. Looking at examples misclassified by this technique revealed important stop words for different relations, which indicates that the removal of stop words is not beneficial in the relation extraction setting. Example types are shown in (Chauhan, 2019).
The availability of fine-grained concept types is likely to boost performance in relation extraction settings. The i2b2 dataset provided finegrained concept types in the form of medical problem, test and treatments. Entity blinding causes almost 9% improvement in classification performance and 1% improvement in detection performance. In contrast, ddi only provided gold standard annotations for drug types in the sentence, and while this does not cause statistically significant improvements for cross validation, it does improve test set classification performance by about 1.5% and detection performance by 1%. For these medical datasets, NER blinding consisted of replacing the detected named entities by Entity because named entity types were not available. Due to the coarse-grained nature of the entities, it hurts classification performance significantly, and detection performance a little.
While entity blinding hurts performance for semeval, possibly due to the coarse-grained nature of the replacement, NER blinding does not hurt performance. Looking at misclassified examples for entity blinding and NER blinding techniques supports this hypothesis (Chauhan, 2019).
To recall, entity blinding involved replacement of entity words by Entity, while NER blinding involved replacing named entities in the sentence with labels such as ORGANIZATION and PER-SON. In settings where fine-grained entity blinding may not be helping, they may be helpful as added features into the model, as shown by (Socher et al., 2012).
For the medical datasets, while classification performance varies highly with different pre-   Table 5: Modeling techniques with original pre-processing. Test set results at the top with cross validated results (average with standard deviation) below. All cross validated results are statistically significant compared to CRCNN model (p < 0.05) using a paired t-test except those marked with a •. In terms of statistical significance, comparing contextualized embeddings with each other reveals that BERT-tokens is equivalent to ELMo for i2b2, but for semeval BERT-tokens is better than ELMo and for ddi BERT-tokens is better than ELMo only for detection.
processing techniques, detection is relatively unaffected. In a setting where one cares more about detection of relationships rather than multi-class classification, one would be able to get away with using non-complicated pre-processing techniques to maintain reasonable performance.

Split Bias
All three datasets evaluate models based on one score on the test set, which is common practice for NLP challenges. Reporting one score as opposed to a distribution of scores has been shown to be problematic by Reimers and Gurevych (2017) for sequence tagging. Recently, Crane (2018) discuss similar problems for question-answering. We show that even if you keep the same random ini-tialization seed (all our experiments have a fixed random initialization seed), train-test set split bias can be another source of variation in scores.
In our experiments, significance testing of some cross validated results reveals no significance even when the test set result improves in performance. This is particularly concerning for ddi where entity blinding (called drug blinding in the literature) is used as a standard pre-processing technique without ablation studies demonstrating its effectiveness. Our results suggest the contrary: entity blinding seems to help test set performance for ddi in table 4, but shows no statistical significance.   No statistical significance is seen even when the test set result worsens in performance for BERT-CLS and Piecewise Pool in table 5 where it hurts test set performance on ddi but is not statistically significant when cross validation is performed. BERT-CLS improves test set result for semeval but is not found to be statistically significant.

Modeling
In  We also tested the improvements offered by different featurizations of contextualized embeddings, which has not been explored much for relation extraction.
Modeling changes were applied with the original pre-processing technique for the CRCNN model with default hyperparameters for semeval and manual hyperparameters for the medical datasets. All comparisons are performed with the baseline performance of the CRCNN model.
While piecewise pooling helps i2b2 by 1%, it hurts test set performance on ddi and doesn't affect performance on semeval. While it may be intuitive to split pooling by entity location, this technique is not generalizable to other datasets.
We also found that while contextualized embeddings generally boost performance, they should be concatenated with the word embeddings before the convolution stage to cause a significant boost in performance. We found ELMo and BERT-tokens to boost performance significantly for all datasets, but that BERT-CLS hurt performance for the medical datasets. While BERT-CLS boosted test set performance for semeval, this was not found to be a statistically significant difference for cross validation. Note that we featurized ELMo similarly to BERT-tokens and the details are present in section 3.2.
This indicates that the technique of featurizing the contextualized embeddings is important for a CNN architecture. Concatenating the contextualized embeddings with the word embeddings keeps a tighter coupling, which is helpful for relation extraction where the word-level ordering might be essential in predicting the relation type.

Hyperparameter Tuning
Bergstra and Bengio (2012) show the superiority of random search over grid search in terms of faster convergence, but leave to future work automating the procedure of manual tuning, i.e. sequential optimization. Bayesian optimization strategies could help with this (Snoek et al., 2012) but often require expert knowledge for correct application. We tested how manual tuning, requiring less expert knowledge than Bayesian optimization, would compare to the random search strategy in table 6. For both i2b2 and ddi corpora, manual search outperformed random search.

Evaluation Metrics
Picking the right evaluation metric for a dataset is critical, and it is important to choose a metric that has the biggest delta between different model performances for example types we care about. Tables for different metric results for all datasets are provided in Appendix B.
When using micro and macro statistics (precision, recall and F1), class imbalance dictates the one to pick. Macro statistics are highly affected by imbalance, whereas micro statistics are able to recover well. Despite suffering due to class imbalance, though, macro statistics may be more appropriate than micro as they provide stronger discriminative capabilities by providing equal importance to classes of smaller sizes. However, micro statistics are as discriminative as macro statistics in settings when the classes are relatively balanced. We are going to talk about the classification tasks in the next two paragraphs.
Compared to semeval, ddi and i2b2 suffer from stark class imbalances. semeval has a Using micro statistics is reasonable for i2b2 because the highly imbalanced class is not included in the calculations. Therefore, this metric is able to be as discriminative as macro statistics. For example, test set micro F1 between baseline and entity blinding techniques is 59.75 and 68.76, while that for macro F1 is 36.44 and 43.76. In contrast, using micro statistics is a bad idea for ddi because the performance on the None class would drive most of the predictive results of the model. For example, micro-F1 between baseline and NER blinding is 88.69 and 86.18, whereas macro-F1 is 65.53 and 57.22. semeval does not have a stark contrast between micro and macro scores due to Other class not being included in the calculation. Using either metric to evaluate models is reasonable for this dataset.
The detection task does not suffer from such variations due to the lower class imbalance. For example, ddi dataset micro-F1 between baseline and NER blinding model is 90.01 and 88.74, while macro-F1 is 81.74 and 79.03. This further suggests that modeling differences and pre-processing differences cause more variation in performance in settings when the class imbalance is higher.

Comparison with SOTA
The best classification test set results found are listed in table 9. Note that we do not compare the extraction task for datasets other than ddi because the official challenges only compared classification results. Even though the official challenge did not rank models based on the detection task, recent papers in the ddi literature mention these results.
Wang et al. (2016) report a result of 88% on semeval and do not provide any public source code for replication purposes. Despite being below the state of the art range, REflex provides the best performing publicly available model for  this dataset. Zheng et al. (2017) report the best result on ddi (77.3%) but perform negative instance filtering, which is a highly specific pre-processing technique that does not fit with the flexible nature of REflex. This technique cuts specific examples from the dataset, but the paper is unclear about whether train as well as test data are shortened. If the test data is being shortened, the performance comparison becomes unfair due to evaluation on different test samples. Unfortunately, source code was not publicly available to answer these questions. Note that Zhao et al. (2016) show that negative instance filtering causes a 4.1% improvement in test set performance. If REflex were to use this pre-processing technique, it would reach close to the state-of-the-art (SOTA) number on the classification task. On the other hand, results from the detection results outperform this model by 2.53%. Sahu et al. (2016) (code unavailable) report a state of the art result of 71.16% on i2b2, which the results in table 9 are able to match. Note that (Rink et al., 2011) report a result of 73.7% with a support vector machine, but they used a larger version of the dataset. Comparison against different subsets of the dataset would not be fair.
Comparison against these numbers demonstrates that REflex is the only open-source framework, providing performance near SOTA ranges for the three datasets. Therefore, REflex can be used as a strong baseline model in future relation extraction studies.

Conclusion
Our findings reveal variations offered by preprocessing and training methodologies, which often go unreported. They indicate that comparing models without having these techniques standardized can make it difficult to assess the true source of performance gains. Our key findings are: 1. Pre-processing can have a strong effect on performance, sometimes more than modeling techniques, as is the case of i2b2. Concept types seem to offer useful information, perhaps revealing more general semantic information in the sentence that can help with predictions. Fine-grained Gold standard annotated concept types are most beneficial, but those from automatically extracted packages may also be useful as long as they consist of multiple types. Punctuation and digits may hold more importance in biomedical settings, but stop words hold significance in all settings.
2. Reporting on one test set score can be problematic due to split bias, and a cross validation approach with significance tests may help ease some of this bias. Drug blinding for ddi is commonly used in the literature but does not seem to offer any statistically significant improvements. Therefore, it is unnecessary to use in this domain.
3. Contextualized embeddings are generally helpful but the featurizing technique is important: for CNN models, concatenating them with the word embeddings before convolution is most beneficial.
4. Picking the right hyperparameters for a dataset is important to performance. We suggest an initial manual hyperparameter search based on cross validation significance tests because that may be sufficient in most cases. If one is not pressed for time, random search is a reasonable automated option for hyperparameter tuning, but requires more experience for picking the right search space and the right distributions for the hyperparameters.
5. Picking the right evaluation metrics for a new dataset should be driven by class imbalance issues for the classes chosen to be evaluated on.