SciBERT: A Pretrained Language Model for Scientific Text

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.


Introduction
The exponential increase in the volume of scientific publications in the past decades has made NLP an essential tool for large-scale knowledge extraction and machine reading of these documents.Recent progress in NLP has been driven by the adoption of deep neural models, but training such models often requires large amounts of labeled data.In general domains, large-scale training data is often possible to obtain through crowdsourcing, but in scientific domains, annotated data is difficult and expensive to collect due to the expertise required for quality annotation.
As shown through ELMo (Peters et al., 2018), GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), unsupervised pretraining of language models on large corpora significantly improves performance on many NLP tasks.These models return contextualized embeddings for each token which can be passed into minimal task-specific neural architectures.Leveraging the success of unsupervised pretraining has become especially important especially when task-specific annotations are difficult to obtain, like in scientific NLP.Yet while both BERT and ELMo have released pretrained models, they are still trained on general domain corpora such as news articles and Wikipedia.
In this work, we make the following contributions: (i) We release SCIBERT, a new resource demonstrated to improve performance on a range of NLP tasks in the scientific domain.SCIBERT is a pretrained language model based on BERT but trained on a large corpus of scientific text.
(ii) We perform extensive experimentation to investigate the performance of finetuning versus task-specific architectures atop frozen embeddings, and the effect of having an in-domain vocabulary.
(iii) We evaluate SCIBERT on a suite of tasks in the scientific domain, and achieve new state-ofthe-art (SOTA) results on many of these tasks.

Methods
Background The BERT model architecture (Devlin et al., 2019) is based on a multilayer bidirectional Transformer (Vaswani et al., 2017).Instead of the traditional left-to-right language modeling objective, BERT is trained on two tasks: predicting randomly masked tokens and predicting whether two sentences follow each other.SCIBERT follows the same architecture as BERT but is instead pretrained on scientific text.
Vocabulary BERT uses WordPiece (Wu et al., 2016) for unsupervised tokenization of the input text.The vocabulary is built such that it contains the most frequently used words or subword units.We refer to the original vocabulary released with BERT as BASEVOCAB.
We construct SCIVOCAB, a new WordPiece vocabulary on our scientific corpus using the Sen-tencePiece1 library.We produce both cased and uncased vocabularies and set the vocabulary size to 30K to match the size of BASEVOCAB.The resulting token overlap between BASEVOCAB and SCIVOCAB is 42%, illustrating a substantial difference in frequently used words between scientific and general domain texts.
Corpus We train SCIBERT on a random sample of 1.14M papers from Semantic Scholar (Ammar et al., 2018).This corpus consists of 18% papers from the computer science domain and 82% from the broad biomedical domain.We use the full text of the papers, not just the abstracts.The average paper length is 154 sentences (2,769 tokens) resulting in a corpus size of 3.17B tokens, similar to the 3.3B tokens on which BERT was trained.We split sentences using ScispaCy (Neumann et al., 2019),2 which is optimized for scientific text.
3 Experimental Setup

Tasks
We experiment on the following core NLP tasks: 1. Named Entity Recognition (NER) 2. PICO Extraction (PICO) 3. Text Classification (CLS) 4. Relation Classification (REL) 5. Dependency Parsing (DEP) PICO, like NER, is a sequence labeling task where the model extracts spans describing the Participants, Interventions, Comparisons, and Outcomes in a clinical trial paper (Kim et al., 2011).REL is a special case of text classification where the model predicts the type of relation expressed between two entities, which are encapsulated in the sentence by inserted special tokens.

Datasets
For brevity, we only describe the newer datasets here, and refer the reader to the references in Table 1 for the older datasets.EBM-NLP (Nye et al., 2018) annotates PICO spans in clinical trial abstracts.SciERC (Luan et al., 2018) annotates entities and relations from computer science abstracts.ACL-ARC (Jurgens et al., 2018) and SciCite (Cohan et al., 2019) assign intent labels (e.g.Comparison, Extension, etc.) to sentences from scientific papers that cite other papers.The Paper Field dataset is built from the Microsoft Academic Graph (Sinha et al., 2015) 3 and maps paper titles to one of 7 fields of study.Each field of study (i.e.geography, politics, economics, business, sociology, medicine, and psychology) has approximately 12K training examples.

Pretrained BERT Variants
BERT-Base We use the pretrained weights for BERT-Base (Devlin et al., 2019) released with the original BERT code. 4 The vocabulary is BASE-VOCAB.We evaluate both cased and uncased versions of this model.
SCIBERT We use the original BERT code to train SCIBERT on our corpus with the same configuration and size as BERT-Base.We train 4 different versions of SCIBERT: (i) cased or uncased and (ii) BASEVOCAB or SCIVOCAB.The two models that use BASEVOCAB are finetuned from the corresponding BERT-Base models.The other two models that use the new SCIVOCAB are trained from scratch.
Pretraining BERT for long sentences can be slow.Following the original BERT code, we set a maximum sentence length of 128 tokens, and train the model until the training loss stops decreasing.We then continue training the model allowing sentence lengths up to 512 tokens.
We use a single TPU v3 with 8 cores.Training the SCIVOCAB models from scratch on our corpus takes 1 week5 (5 days with max length 128, then 2 days with max length 512).The BASEVOCAB models take 2 fewer days of training because they aren't trained from scratch.
All pretrained BERT models are converted to be compatible with PyTorch using the pytorchtransformers library. 6ll our models (Sections 3.4 and 3.5) are implemented in PyTorch using AllenNLP (Gardner et al., 2017).
Casing We follow Devlin et al. (2019) in using the cased models for NER and the uncased models for all other tasks.We also use the cased models for parsing.Some light experimentation showed that the uncased models perform slightly better (even sometimes on NER) than cased models.

Finetuning BERT
We mostly follow the same architecture, optimization, and hyperparameter choices used in Devlin et al. (2019).For text classification (i.e.CLS and REL), we feed the final BERT vector for the [CLS] token into a linear classification layer.For sequence labeling (i.e.NER and PICO), we feed the final BERT vector for each token into a linear classification layer with softmax output.We differ slightly in using an additional conditional random field, which made evaluation easier by guaranteeing well-formed entities.For DEP, we use the model from Dozat and Manning (2017) with dependency tag and arc embeddings of size 100 and biaffine matrix attention over BERT vectors instead of stacked BiLSTMs.
In all settings, we apply a dropout of 0.1 and optimize cross entropy loss using Adam (Kingma and Ba, 2015).We finetune for 2 to 5 epochs using a batch size of 32 and a learning rate of 5e-6, 1e-5, 2e-5, or 5e-5 with a slanted triangular schedule (Howard and Ruder, 2018) which is equivalent to the linear warmup followed by linear decay (Devlin et al., 2019).For each dataset and BERT variant, we pick the best learning rate and number of epochs on the development set and report the corresponding test results.
We found the setting that works best across most datasets and models is 2 or 4 epochs and a learning rate of 2e-5.While task-dependent, optimal hyperparameters for each task are often the same across BERT variants.

Frozen BERT Embeddings
We also explore the usage of BERT as pretrained contextualized word embeddings, like ELMo (Peters et al., 2018), by training simple task-specific models atop frozen BERT embeddings.
For text classification, we feed each sentence of BERT vectors into a 2-layer BiLSTM of size 200 and apply a multilayer perceptron (with hidden size 200) on the concatenated first and last BiL-STM vectors.For sequence labeling, we use the same BiLSTM layers and use a conditional random field to guarantee well-formed predictions.
For DEP, we use the full model from Dozat and Manning (2017) with dependency tag and arc embeddings of size 100 and the same BiLSTM setup as other tasks.We did not find changing the depth or size of the BiLSTMs to significantly impact results (Reimers and Gurevych, 2017).
We optimize cross entropy loss using Adam, but holding BERT weights frozen and applying a dropout of 0.5.We train with early stopping on the development set (patience of 10) using a batch size of 32 and a learning rate of 0.001.
We did not perform extensive hyperparameter search, but while optimal hyperparameters are going to be task-dependent, some light experimentation showed these settings work fairly well across most tasks and BERT variants.

Results
Table 1 summarizes the experimental results.We observe that SCIBERT outperforms BERT-Base on scientific tasks (+2.11F1 with finetuning and +2.43 F1 without)8 .We also achieve new SOTA results on many of these tasks using SCIBERT.
SCIBERT performs slightly worse than SOTA on 3 datasets.The SOTA model for JNLPBA is a BiLSTM-CRF ensemble trained on multiple NER datasets not just JNLPBA (Yoon et al., 2018).The SOTA model for NCBI-disease is BIOBERT (Lee et al., 2019), which is BERT-Base finetuned on 18B tokens from biomedical papers.The SOTA result for GENIA is in Nguyen and Verspoor ( 2019) which uses the model from Dozat and Manning (2017) with part-of-speech (POS) features, which we do not use.
In Table 2, we compare SCIBERT results with reported BIOBERT results on the subset of datasets included in (Lee et al., 2019) BC5CDR and ChemProt, and performs similarly on JNLPBA despite being trained on a substantially smaller biomedical corpus.

Computer Science Domain
We observe that SCIBERT outperforms BERT-Base on computer science tasks (+3.55 F1 with finetuning and +1.13 F1 without).In addition, SCIBERT achieves new SOTA results on ACL-ARC (Cohan et al., 2019), and the NER part of SciERC (Luan et al., 2018).For relations in Sci-ERC, our results are not comparable with those in Luan et al. (2018) because we are performing relation classification given gold entities, while they perform joint entity and relation extraction.

Multiple Domains
We observe that SCIBERT outperforms BERT-Base on the multidomain tasks (+0.49F1 with finetuning and +0.93 F1 without).In addition, SCIBERT outperforms the SOTA on SciCite (Co-han et al., 2019).No prior published SOTA results exist for the Paper Field dataset.

Effect of Finetuning
We observe improved results via BERT finetuning rather than task-specific architectures atop frozen embeddings (+3.25 F1 with SCIBERT and +3.58 with BERT-Base, on average).For each scientific domain, we observe the largest effects of finetuning on the computer science (+5.59 F1 with SCIB-ERT and +3.17 F1 with BERT-Base) and biomedical tasks (+2.94 F1 with SCIBERT and +4.61 F1 with BERT-Base), and the smallest effect on multidomain tasks (+0.7 F1 with SCIBERT and +1.14 F1 with BERT-Base).On every dataset except BC5CDR and SciCite, BERT-Base with finetuning outperforms (or performs similarly to) a model using frozen SCIBERT embeddings.

Effect of SCIVOCAB
We assess the importance of an in-domain scientific vocabulary by repeating the finetuning experiments for SCIBERT with BASEVOCAB.We find the optimal hyperparameters for SCIBERT-BASEVOCAB often coincide with those of SCIB-ERT-SCIVOCAB.
Given the disjoint vocabularies (Section 2) and the magnitude of improvement over BERT-Base (Section 4), we suspect that while an in-domain vocabulary is helpful, SCIBERT benefits most from the scientific corpus pretraining.

Related Work
Recent work on domain adaptation of BERT includes BIOBERT (Lee et al., 2019) and CLINI-CALBERT (Alsentzer et al., 2019;Huang et al., 2019).BIOBERT is trained on PubMed abstracts and PMC full text articles, and CLIN-ICALBERT is trained on clinical text from the MIMIC-III database (Johnson et al., 2016).In contrast, SCIBERT is trained on the full text of 1.14M biomedical and computer science papers from the Semantic Scholar corpus (Ammar et al., 2018).Furthermore, SCIBERT uses an in-domain vocabulary (SCIVOCAB) while the other abovementioned models use the original BERT vocabulary (BASEVOCAB).

Conclusion and Future Work
We released SCIBERT, a pretrained language model for scientific text based on BERT.We evaluated SCIBERT on a suite of tasks and datasets from scientific domains.SCIBERT significantly outperformed BERT-Base and achieves new SOTA results on several of these tasks, even compared to some reported BIOBERT (Lee et al., 2019) results on biomedical tasks.
For future work, we will release a version of SCIBERT analogous to BERT-Large, as well as experiment with different proportions of papers from each domain.Because these language models are costly to train, we aim to build a single resource that's useful across multiple domains.

Table 1 :
Test performances of all BERT variants on all tasks and datasets.Bold indicates the SOTA result (multiple results bolded if difference within 95% bootstrap confidence interval).Keeping with past work, we report macro F1 scores for NER (span-level), macro F1 scores for REL and CLS (sentence-level), and macro F1 for PICO (token-level), and micro F1 for ChemProt specifically.For DEP, we report labeled (LAS) and unlabeled (UAS) attachment scores (excluding punctuation) for the same model with hyperparameters tuned for LAS.All results are the average of multiple runs with different random seeds.