Scientific Keyphrase Identification and Classification by Pre-Trained Language Models Intermediate Task Transfer Learning

Scientific keyphrase identification and classification is the task of detecting and classifying keyphrases from scholarly text with their types from a set of predefined classes. This task has a wide range of benefits, but it is still challenging in performance due to the lack of large amounts of labeled data required for training deep neural models. In order to overcome this challenge, we explore pre-trained language models BERT and SciBERT with intermediate task transfer learning, using 42 data-rich related intermediate-target task combinations. We reveal that intermediate task transfer learning on SciBERT induces a better starting point for target task fine-tuning compared with BERT and achieves competitive performance in scientific keyphrase identification and classification compared to both previous works and strong baselines. Interestingly, we observe that BERT with intermediate task transfer learning fails to improve the performance of scientific keyphrase identification and classification potentially due to significant catastrophic forgetting. This result highlights that scientific knowledge achieved during the pre-training of language models on large scientific collections plays an important role in the target tasks. We also observe that sequence tagging related intermediate tasks, especially syntactic structure learning tasks such as POS Tagging, tend to work best for scientific keyphrase identification and classification.


Introduction
Scientific Keyphrase Identification and Classification (SKIC) is the task of identifying and classifying scientific terms in research papers. An effective keyphrase identification and classification system can benefit a wide range of natural language processing and information retrieval tasks including question answering (Quarteroni and Manandhar, 2006), question generation (Subramanian et al., 2018), and expert finding (Chen et al., 2015;. Several works formulate SKIC as a classification task using neural methods and word embeddings (Luan et al., 2018;Liu et al., 2017) and show promising results compared to previous hand-crafted feature processing with sequence labeling such as Conditional Random Fields (Lee et al., 2017).
Scientific keyphrases are very diverse and this diversity leads to a burden for the creation of large dataset collections, which results in data scarcity problems for deep neural networks since domain experts are required to obtain reliable annotations for keyphrase identification and classification. In order to overcome the small size data problem of SKIC, Augenstein and Søgaard (2017) propose deep multi-task learning and reveal that several related tasks such as noun and verb phrase chunking, super-sense tagging, and multi-word expression identification are helpful to SKIC. However, the results of this approach are still low , which implies that there is room for improvement.
Recent works in natural language processing show that employing unsupervised pre-trained language models such as BERT (Devlin et al., 2019) and SciBERT (Beltagy et al., 2019), a variant of BERT that is pre-trained on scholarly texts, can bring substantial improvement in the performance of various Natural Language Understanding (NLU) tasks (Jiang and de Marneffe, 2019;Klein and Nabi, 2019). Furthermore, Phang et al. (2018) and Pruksachatkun et al. (2020) propose to further improve these language models by using intermediate task transfer learning. Intermediate task transfer learning is a simple strategy of fine-tuning on a data-rich intermediate task before fine-tuning on the downstream target tasks. To this end, we investigate the performance of intermediate task transfer learning for SKIC. Specifically, we run our experiments on seven intermediate tasks on BERT and SciBERT and six target tasks (three for keyphrase boundary identification and three for keyphrase type classification) using three datasets of scientific papers that cover different domains, e.g., Materials Science, Physics, Computer Science, Natural Language Processing, and Artificial Intelligence.
Our contributions are summarized as follows: First, we build a transfer learning framework employing a diverse range of intermediate tasks covering sequence tagging with semantic and syntactic aspects, and natural language inference. We achieve competitive performance over both strong baselines and previous works. While transfer learning using SciBERT successfully achieves improvement in performance, we observe that intermediate transfer learning using BERT causes catastrophic forgetting (Chronopoulou et al., 2019) with a significant performance deterioration. Second, we empirically observe that specific intermediate task types are more useful as intermediate tasks for SKIC. Specifically, sequence tagging tasks such as POS tagging are preferable than natural language inference tasks such as entailment recognition. Third, we provide a qualitative analysis of extracted keyphrases and show that our transfer learning successfully returns keyphrases. This qualitative analysis proves the reliability of our proposed methods.

Related Work
Scientific keyphrase identification and classification (SKIC) was proposed at SemEval 2017 Task 10 as Task A: Keyphrase Identification, and Task B: Keyphrase Classification . The authors of this shared task proposed to employ the BIO schema, which refers to the beginning (B), inside (I), or outside (O) of keyphrases, respectively. Most of the earlier approaches to SKIC rely on hand-crafted linguistic features. For example, Lee et al. (2017) and Marsi et al. (2017) employ syntactic and semantic features such as Part-of-Speech tags and word lemmas, as input to Conditional Random Fields (CRFs). Liu et al. (2017) formulate SKIC as a supervised multi-class classification problem. They exploit pre-trained word embeddings and linguistically inspired features, e.g., noun phrase features and orthographic features, which represent character and symbolic features of given tokens, as input to Support Vector Machines. With the success of neural models, recent works try to address SKIC using neural architectures while exploiting the BIO schema. Although both tasks, keyphrase identification and keyphrase classification according to their types, are very important, many works focused only on keyphrase extraction/generation or identification/segmentation (Meng et al., 2017;Xiong et al., 2019;Patel and Caragea, 2019;Alzaidy et al., 2019;Chen et al., 2020). The classification task is less explored possibly due to a lack of a large number of gold-label keyphrase classification datasets. Precisely, there are only a few publicly available datasets for keyphrase classification (QasemiZadeh and Schumann, 2016;Luan et al., 2018) and these datasets are small in size. To overcome the small dataset size problem, Augenstein and Søgaard (2017) proposed the multitask learning of SKIC. According to this work, they reveal that sequence tagging auxiliary tasks such as Chunking, Super-sense Tagging, Multi-words Expressions Identification, and FrameNet Target Identification are beneficial to SKIC within a multitask learning framework (with one auxiliary task at a time).
The recent unsupervised pre-trained language models such as BERT (Devlin et al., 2019) achieve stateof-the-art performance in many downstream NLP tasks, such as named entity recognition e.g., 95.5% F1 on the CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003). Han and Eisenstein (2019) apply BERT and use fine-tuning of BERT to reduce the vocabulary gap between canonical domains and historical domains, within an unsupervised approach. Moreover, the domain-specific BERT models such as SciBERT (Beltagy et al., 2019), produce better results compared to a general domain-based BERT with respect to scientific knowledge required tasks such as scientific term classification e.g., 64.57% F1 on the SciIE dataset (Luan et al., 2018). To exploit these pre-trained language models, Phang et al. (2018) propose a data-rich intermediate transfer learning and prove that this method provides a better starting point of target tasks. Extended from this, in our work, we propose the transfer of the intermediate tasks fine-tuning to overcome the scarcity of gold-labeled large SKIC datasets.

Methodology
Given a document d = {w 1 , w 2 , . . . w n }, where n denotes the length of the document and w k is a word in d (k = 1, · · · , n), the objective of SKIC is to identify and classify keyphrases of d by employing an output sequence {y 1 , y 2 , . . . y n }, where y k follows BIO scheme for pre-defined classes, e.g., task, material, process. In the BIO scheme, 'B' denotes the beginning word of a keyphrase, 'I' refers to the inside word of a keyphrase, and 'O' indicates all other words that are not part of a keyphrase. An example of input-output pairs of keyphrase identification and classification is shown in Figure 1.

Experimental Pipeline
Our transfer learning consists of two phases as shown in Figure 1: (1) Pre-trained language models BERT and SciBERT are fine-tuned respectively on a single intermediate task; (2) The pre-trained language models' fine-tuned-parameters are transferred to the target task and fine-tuned further on each target task. In target task fine-tuning, we apply different learning rates to retain knowledge that is acquired from intermediate tasks. Specifically, as shown in Figure 1, the target task's fully connected classifier uses a higher learning rate, while pre-trained language models' parameters are assigned as a lower learning rate to retain knowledge.

Intermediate Tasks
To explore what intermediate tasks are effective for keyphrase identification and classification, we choose various tasks. Intermediate tasks statistics are shown in Table 1. Note that for intermediate tasks that are related to sequence tagging, we report phrase-level instance numbers.
We choose seven intermediate tasks based on the following considerations: 1) Augenstein and Søgaard (2017) propose several auxiliary tasks where they improve performance of SKIC using a multitask learning framework; 2) Pruksachatkun et al. (2020) prove the usefulness of Natural Language Inferences (NLIs) in intermediate task transfer learning for general Natural Language Understanding. In our setting, document understanding is essential to identifying the precise keyphrases (or the important parts) for a document. In experiments, we use two NLI tasks where each task has different vocabulary sets, i.e., general domain (MNLI) and scientific domain (SciTail); 3) A traditional NLP sequence tagging task, i.e., Part-Of-Speech Tagging, to understand whether a model that acquire better ability of understanding grammatical structures improves the performance of SKIC. The detailed information of each intermediate task follows.
Supersense Supersense tagging (Johannsen et al., 2014) is the task of mapping semantic units such as compounds, phrase and several other linguistic terms to Wordnet's lexicographers classes using Semcor  3.0 corpus. 1 In this work, we focus on noun supersense classes which are Person, Location and Group.
For example, in a given sentence 'Boston Red Sox outfielder Jackie Jensen said he played baseball on monday night', the supersense of Boston Red Sox is Group, and that of Jackie Jensen is Person.
MWEs Multiword Expressions Identification and Classification (Schneider and Smith, 2015) is the task of detecting and classifying idiosyncratic noun and verb combinations to literal categories, which are high-level ontological semantic classes, using Streusle corpus. 2 For example, in a given sentence 'We have been blessed to find elite flyers online and would not use anyone else to handle our postcards posters etc.', examples of high-level ontological semantic classes are: for blessed is verb.cognition, for find is verb.cognition, for use is verb.social and for postcards posters is noun.artifact.
FrameNet FrameNet Target Identification (Das et al., 2014) is the task of detecting semantic frame evoking word sequences or words in the FrameNet 1.7 corpus, which contains lexical and predicatearguments. Frames are words or phrases that describe events, relations, objects and the participants in it. For example, a target such as moist in a sentence evokes the frame Being as state.
SciTail SciTail (Khot et al., 2018) is the task of recognizing the entailment of a hypothesis that is constructed from a science question and its corresponding answer by employing the premise. The dataset is collected by crowd-sourcing and by multiple-choice science questions from 4th-grade and 8th-grade exams. For example, the relation between the following two sentences, 'Neurons receive information from dendrites which are then passed to the soma cell body.' and 'Dendrites from the cell body receives impulses from other neurons.' is labeled as Entail.
MNLI Multi-Genre Natural Language Inference (Williams et al., 2018) is the task of determining textual entailment in sentence pairs across a variety of genres of written and spoken English, ranging from fiction to face-to-face conversations. For example, the relation between the following two sentences, 'This approach provides perhaps a better technique for isolating the actual costs of the emissions caps.', 'There is no way to estimate the actual cost of emissions caps.' is labeled as Contradiction.
Chunking Text Chunking is the task of detecting the chunks of words in CoNLL-2000 shared dataset (Tjong Kim Sang and Buchholz, 2000

Experiments
Baseline We compare the intermediate task transfer learning with the following baselines.
• BiLSTM (Augenstein and Søgaard, 2017) A 3-layer BiLSTM with SENNA embeddings 6 for each target task.   training for a maximum of 10 epochs with negative log-likelihood loss. Aside from these details, we follow the SciBERT paper for all other training hyper-parameters. All experiments are done in p3.2xlarge settings of Amazon Web Services. All model parameters are estimated on the validation set of each task. We evaluate the performance of each model using phrase-level micro-averaged F1 and use the exact match metric (Kim et al., 2010). Table 3 presents our identification and classification results. We make the following observations. First, we observe that SciBERT generally shows higher performance across all of our target tasks in comparison to BERT. Interestingly, BERT and SciBERT have different vocabulary set which only overlapped 42% (Beltagy et al., 2019). We posit that scientific knowledge that is available in SciBERT boosts the performance of SKIC. However, the fine-tuning of SciBERT on keyphrase identification for both SemEval 2017 and ACL datasets still remains challenging since the performance is lower than the best performance of the previous work (Augenstein and Søgaard, 2017). For example, the best baseline F1-score on SemEval 2017 keyphrase identification is 72.42%, whereas that of SciBERT fine-tuned on SemEval 2017 keyphrase identification achieves only 66.70%.

Discussion
Second, our SciBERT transfer learning achieves the best results in all identification and classification target tasks. This implies that intermediate task fine-tuning leads to better starting points for the target task fine-tuning by injecting the knowledge related to SKIC. Specifically, in keyphrase identification and classification of the SemEval 2017 dataset, the model gains significant effectiveness when FrameNet is employed as an intermediate task (+7.19 F1), and when POS Tagging (+8.69 F1) is used as an intermediate task, respectively. For the ACL dataset, we observe that exploiting POS Tagging as an intermediate task achieves the best performance of both identification (+6.16 F1) and classification (+4.79 F1). This implies that a better syntactic understanding ability of the model induces performance improvement. The SciIE dataset performs the best with transfer from Supersense in keyphrase identification (+1.63 F1), and MNLI in keyphrase classification (+7.79 F1). More interestingly, the contribution of intermediate tasks to target tasks is different by their task types. When natural language inference tasks are employed as intermediate tasks, we generally observe performance degradation, while most of the sequence tagging tasks are consistently helpful for SKIC. Syntactically sequence tagging tasks such as chunking and POS Tagging are also generally helpful to SKIC. Another important observation is that the domain difference in natural language inference tasks (general and scientific domain) results in different outputs. For example, in all three datasets, transferring from MNLI in SciBERT shows better performance than transferring from SciTail in BERT. MNLI and SciTail lie on similar inference recognition settings but have different domains, which are general domains and scientific domains, respectively.
Third, we observe that BERT potentially suffers from catastrophic forgetting as opposed to SciBERT. In BERT transfer learning, most of the intermediate tasks fail to provide useful knowledge for the target tasks (i.e., they appear to make the fine-tuned models forget the good information learned within the pre-trained BERT), and hence, result in a severe deterioration of performance. For example, on the ACL RD-TEC 2.0 keyphrase identification task, when SciTail is employed as an intermediate task on BERT transfer learning, the performance is decreased by as much as -15.73 in F1, compared to single BERT fine-tuning. As we can see from Table 3, all cases of BERT intermediate task transfer learning in ACL and SciIE dataset degrade performance in comparison to BERT, while most of the cases in SciBERT transfer learning improve performance compared to that of SciBERT.
Article from SemEval 2017 Task 10 PV cells are one of the most promising technologies for conversion of incident solar radiation into electric power. However, this technology is still far from being able to compete with fossil fuel-based energy conversion technologies because of its relatively low efficiency and energy density. Theoretically, there are three unavoidable losses that limit the solar conversion efficiency of a device with a single absorption threshold or band gap Eg: (1) incomplete absorption, where photons with energies below Eg are not absorbed; (2) thermalization or carrier cooling, where solar photons with sufficient energy generate electron-hole pairs and then immediately lose almost all energy in excess of Eg in the form of heat; and (3) radiative recombination, where a small fraction of the excited states radioactively recombine with the ground state at the maximum power output (Hanna & Nozik, 2006;Henry, 1980). Taking an air mass of 1.5 as an example, for different band gap Eg these three losses can be calculated and the results are indicated by areas S1, S2, and S3 in Fig. 1. Note that the area under the outer curve is the solar power per unit area, and that only S4 can be delivered to the load.

Error Analysis
We visualize confusion matrices of the best performing SciBERT transfer learning keyphrase classification results in Figure 2. Specifically, for the SemEval 2017 Task 10 dataset and the ACL dataset, we plot the result of SciBERT transfer learning from POS tagging. For the SciIE dataset, we plot the result of SciBERT transfer learning from MNLI. The numbers in Figure 2 represent how many classified keyphrases belong to each true class. For example, in SemEval 2017 Task 10 dataset confusion matrix, the cell corresponding to row Process and column Task refers to the ratio of keyphrases predicted as Task but which should be classified as Process to the total number of keyphrases that are classified as Task. Consequently, each row in every confusion matrix sums up to 1.
We observe that SemEval 2017 data suffers from mis-classifying the class Task possibly due to the imbalanced data distribution. The distribution of categories in SemEval 2017 Task 10, Process, Material, Task is 50%/35%/15%, respectively. Moreover, our SciBERT transfer learning makes erroneous predictions between Process and Task categories due to the subjectivity of two classes as shown in Table  4. Table 4 presents an example article of the SemEval 2017 and its keyphrase classification comparison between gold labels and SciBERT transfer learning from POS Tagging. We observe that the keyphrase conversion of incident solar radiation into electric power is annotated as Process, while it is reasonable to think of it as Task, which is the output of our transfer learning model prediction. This type of error is not necessarily a shortcoming of SciBERT, but rather of the data annotation and its subjectivity. According to this, we can also confirm the difficulties of the SKIC data collections.
For the ACL dataset, we observe that the category Language Resource (LR) is mis-classified as Other class. For example, treebank is annotated as LR but our model predicts it as Other. Further, the data imbalanced also causes performance degradation in the ACL dataset. This is because the proportion of LR, which is one of the classes in the ACL dataset, is very small in size (5.7%), while the percentage of Other is significantly higher (42.8%). Interestingly, in contrast to the above two target datasets, in the SciIE dataset, our SciBERT transfer learning generally performs very well.

Conclusion, Discussion, and Future Work
We investigated the performance of data-rich intermediate task transfer learning for scientific keyphrase identification and classification (SKIC) using pre-trained language models BERT and SciBERT. We perform experiments on SciBERT and BERT with a total of 42 pairs of intermediate and target tasks, where intermediate tasks are drawn from sequence tagging and natural language inference. We found that employing sequence tagging tasks as intermediate tasks on SciBERT performs the best on three publicly available keyphrase identification and classification datasets. Further, the intermediate task transfer learning using SciBERT outperforms both the previous work and strong baselines by a large margin. Specifically, for the SemEval 2017 Task 10 dataset, using FrameNet as an intermediate task on SciBERT transfer learning yields +7.19 improvement in F1 in keyphrase identification, and using POS Tagging yields +8.69 improvement in F1 in keyphrase classification. For the ACL RD-TEC 2.0 dataset, using POS Tagging leads to +6.16 F1 in keyphrase identification, and +4.79 F1 in keyphrase classification. For the SciIE dataset, using Supersense brings +1.63 F1 in keyphrase identification, and +7.79 F1 in keyphrase classification. According to these results, syntactic structure learning related intermediate tasks such as POS Tagging are preferable for SKIC tasks. Further, looking at our intermediate task domain difference that lies in natural language inference tasks, we explore their impact on the pre-trained language models. In particular, we empirically showed that using MNLI as an intermediate task on SciBERT transfer learning returns higher performance than employing SciTail as an intermediate task on BERT transfer learning. Future works in this area will benefit from the improvement of the available intermediate tasks and other related intermediate tasks will be explored. Moreover, a better understanding of when and why these intermediate tasks are working is one of the interesting future directions.
Interestingly, we observe that BERT suffers from a serious drop in performance possibly due to catastrophic forgetting. In particular, we observe that almost all of the intermediate tasks fail to provide better starting points for SKIC pre-trained language models fine-tuning. While SemEval 2017 Task 10 dataset achieves the best results when transfer from Chunking on BERT, this result is lower than single SciB-ERT fine-tuning. On ACL RD-TEC 2.0 and SciIE, no intermediate task produces higher performance on BERT intermediate task transfer learning. One potential reason could be a large vocabulary gap between our domain and the collections used to pre-train BERT. In the future, we plan to analyze the differences between BERT and SciBERT to better understand the effects of transfer learning for SKIC.