SciWING– A Software Toolkit for Scientific Document Processing

We introduce SciWING, an open-source soft-ware toolkit which provides access to state-of-the-art pre-trained models for scientific document processing (SDP) tasks, such as citation string parsing, logical structure recovery and citation intent classification. Compared to other toolkits, SciWING follows a full neural pipeline and provides a Python inter-face for SDP. When needed, SciWING provides fine-grained control for rapid experimentation with different models by swapping and stacking different modules. Transfer learning from general and scientific documents specific pre-trained transformers (i.e., BERT, SciBERT, etc.) can be performed. SciWING incorporates ready-to-use web and terminal-based applications and demonstrations to aid adoption and development. The toolkit is available from http://sciwing.io and the demos are available at http://rebrand.ly/sciwing-demo.


Introduction
Automated scientific document processing (SDP) deploys natural language processing on scholarly articles, which are long-form, complex documents with conventional structure and crossreference to external resources.Representative SDP tasks include: parsing embedded reference strings; identifying the importance, sentiment and provenance for citations; identifying logical sections and markup; parsing of equations, figures and tables; and summarization.SDP tasks, in turn, help downstream systems and assist scholars in finding relevant documents and manage their knowledge discovery and utilization workflows.Next-generation introspective digital libraries such as Semantic Scholar (Ammar et al., 2018) have begun to incorporate such services.
While natural language processing (NLP) in general has seen tremendous progress with the introduction of neural network architectures and general toolkits and datasets to leverage them, their deployment for SDP is still limited.Over the past few years, many open-source software packages have accelerated the development of state of the art NLP models.Hugging Face's transformer models (Wolf et al., 2019) and AllenNLP (Gardner et al., 2017) are general-purpose frameworks that have produced state-of-the-art models for natural language understanding tasks.However, these and many other tools do not provide comprehensive access to pre-trained scientific document processing models.
A key barrier to entry is accessibility: a nontrivial amount of expertise in NLP and machine learning methodologies is a prerequisite, which many scholars who wish to deploy SDP lack and have no interest in obtaining.Thus there is a clear need for a toolkit that packages easy access to pretrained, state-of-the-art models, while also allowing researchers to experiment with models rapidly to create deployable applications.
We introduce SciWING to close this gap.Built on top of PyTorch and under active development, it provides easy access to modern neural network models trained for a growing number of SDP tasks which practitioners can easily deploy on their documents.For researchers, these models serve as baselines for experimentation and the basis for the easy construction of more complex architectures in a modular manner.SciWING affords the swapping different neural network modules, allowing researchers to declare models in a configuration file without having to write programming code.

System Overview
Our view is that SDP-specific considerations are best embodied as an abstract layer over existing NLP frameworks.SciWING incorporates the generic NLP pipeline AllenNLP (Gardner et al., 2017), developing models on top of it, while using the transformers package to enable transfer learn- ing via its pre-trained general-purpose transformers such as BERT (Devlin et al., 2019) and SDPspecific ones; i.e., SciBERT (?).
SciWING builds with Python 3.7 and is provisioned as a package available on the Python Packaging Index (PyPI), supporting installation via pip install sciwing.This downloads its PyTorch and AllenNLP library dependencies.For users aiming to develop the library further, SciWING comes with installation tools that set up the system, alongside extensive in-lined code documentation, and explanatory tutorials.
SciWING separates Dataset, Model and Engine components (Fig. 1), facilitating their flexible reconfiguration: • Datasets: There are many challenges for the researcher-practitioner to experiment with different SDP tasks.First, researchers are often dealt with the challenge of handling various formats of the datasets: for reference string parsing, the CoNLL format is most common; for text classification, the CSV format is most common.SciWING enables reading of dataset files in different formats and also facilitates the easy download of open datasets using command-line interfaces.For example, sciwing download scienceie downloads the official openly available dataset for the ScienceIE task.Additionally, pre-processing is cumbersome and error-prone.It becomes complex when different models require different tokenisation and numericalisation methods.SciWING unifies these various input formats through a pipeline of pre-processing, tokenisation and numericalisation, providing automatic means to pre-process via Tokenisers and Numericalisers.SciWING also handles batching and padding of examples.
• Models: The below two subcomponents are combined to build an instance of a neural network model.The models are PyTorch based classes.
State-of-the-art models are easily built by concatenating multiple representations, via SciWING's ConcatEmbedders module.For example, word and character embeddings are combined in NER models (Lample et al., 2016), multiple contextual word embeddings are combined in various clinical and Bio-NLP tasks (Zhai et al., 2019).
• Neural Network Encoders: SciWING consists of commonly-used neural network components that can be composed to form neural architectures for different tasks.For example in text classification, encoding input sentence as a vector using an LSTM is a common task (SciWING's seq2vecencoder).
Another common operation is obtaining a sequence of hidden states for a set of tokens, often used in sequence labelling tasks (Sci-WING's Lstm2seq).
SciWING has generic linear classification and sequence labelling with CRF heads that can be attached to the encoders to build the final model.It provides pretrained state-ofthe-art models for particular SDP tasks that work out-of-the-box for prediction or which can be further fine-tuned.With these components given, SciWING's Inference middleware provides clear abstractions to perform inference once models are trained.The layer runs predictions on the test dataset, user inputs and files.Such abstractions also act as an interface for the development of upstream REST APIs and command-line applications.

Configuration using TOML files
A defining feature of SciWING is its use of a declarative TOML configuration file 3 .This enables users to declare dataset, model architec-

Command Line Interface
Qualitatively analysing the results of the model by drilling down to certain training and development instances can be telling and help to diagnose performance issues.SciWING provides an interactive inspection of the model for this reason through a command-line interface (CLI).Consider the task of reference string parsing: the confusion matrix for the different classes can be displayed through the provided CLI utility, which also allows finer-grained introspection of (Precision, Recall, F-measure) metrics and the viewing of error instances where one class is confused for another.
SciWING provides commands to run experiments from the configuration file, aiding replication.For example, if the experiments are declared in a file named experiment.toml,then the experiments can be run with the command sciwing run experiment.toml.This runs the experiment, saving the best model.Inference is then trivially invoked via sciwing test experiment.tomlwhich deploys the best model against the test dataset and display the resultant metrics.

End User Interfaces
API service enables easy development of various Graphical User Interfaces.SciWING currently exposes APIs for reference string parsing and citation intent classification.using fastapi 4 : The API enables the following application families downstream: • Web demonstrations provide quick access to predictions from state-of-the-art models, fulfilling one key aim of SciWING.Prespecified data can be chosen or user data can be entered and quickly processed using the distributed models (as in Figure 3).
• Programmatic Interfaces in SciWING provisions more advanced usage.Users can make predictions for data stored in a file or fine-tune model on their data.For example, if the user wants to parse the citations, where a text file contains all the citations, then Sci-WING provides a NeuralParscit class that has easy methods to parse all the strings in a file and store it in a new file.• Reference String Parsing identifies the components of a reference string that corresponds with a in-document citation: author, journal and year of publication, among 13 classes.Neural network methods for reference string parsing show stateof-the-art performance (Prasad et al., 2018) as a sequence labelling task, combining a bidirectional LSTM with CRF.SciWING's distributed model implements the same model architecture, also uses ELMo embeddings.
• ScienceIE identifies typed keyphrases, originally from chemical documents: Task keyphrases that denote the end task or goal, Material keyphrases indicate any chemical, and Dataset that is being used by the scientific work and the process includes any scientific model or algorithm.The state of the art system from 2017 includes a bidirectional LSTM with CRF and uses language model embeddings (Ammar et al., 2017).Sci-WING includes a reference implementation substituting modern ELMo embeddings for the original language model representation.
• Logical Structure Recovery identifies the logical sections of a document: introduction, related work, methodology, and experiments.This drives the relevant, targeted text to downstream tasks like summarization, citation intent classification, among others.Currently, there are no neural network methods for this task, so SciWING's models can serve as baselines for future research.
Other tasks and datasets are being actively developed.We envision the development of additional models will be swift and easy.

Use Case: Reference String Parsing
We illustrate how to construct SciWING models, building up to the state-of-the-art model by simple modifications.This also facilitates ablation studies, common part to empirical studies.
Bi-LSTM tagger: Our base model is a bidirectional LSTM model.It uses a GLoVE embedder.Every input token is classified into one of 13 dif-ferent classes.Bi-LSTM tagger with character and ELMo Embeddings: We modify the code to include a bidirectional LSTM character embedder.We use the ConcatEmbedders module to create the final word embeddings (Line 16), which concatenates the character embeddings with those from the previous word embedding and a pretrained ELMo contextual word embedding.This final model is the provisioned model for the reference string parsing task provided in SciWING.

Related Work
Grobid (GRO, 2008(GRO, -2020) ) is the closest to a general workbench for scientific document process-ing.Similarly to SciWING, Grobid also performs document structure classification, reference string parsing, among other tasks.However, Grobid is not a deep learning framework for scientific document processing.SciSpacy (Neumann et al., 2019) focuses on biomedical related tasks like POS-tagging, syntactic parsing and biomedical span extraction.However, SciSpacy primarily for deployment, as it does not allow development and testing of new models and architectures.
In contrast, task-and domain-agnostic frameworks also exist.NCRF++ (Yang and Zhang, 2018) is a tool for performing sequence tagging using Neural Networks and Conditional Random Fields. is a general NLP framework.SciWING interfaces with general purpose AllenNLP (Gardner et al., 2017) NLP framework and allows easy access to pre-trained neural network models for scientific document processing.The Transformers framework (Wolf et al., 2019) enables simple access to pre-trained transformer architectures such as BERT, XLNet, Alberta and Roberta.Sci-WING also builds on top of the transformers package to give researchers easy access to generalpurpose and scientific document processing specific contextual word embeddings like SciBERT for its tasks.

Conclusion
We introduce SciWING, an open-source scholarly document processing (SDP) toolkit, targeted at practitioners and researchers interested in rapid experimentation.It provisions pre-trained models for key SDP tasks -such as citation string parsing and citation intent classification -that achieve state-of-the-art performance.
SciWING's modular design greatly facilitates model architecture development, speedy train/test cycles for architecture search, and supports transfer learning for use cases with limited annotated data.Configuration driven, SciWING allows the user to declare models, datasets and experiment parameters in a single configuration file.
SciWING is actively being developed.Our current development target is to incorporate models for sequence to sequence (seq2seq) generation, and multi-task learning (to ameliorate challenges with sparse data), alongside the implementation of additional SDP tasks and models.We hope that SciWING fosters collaboration among the SDP community, and encourage assistance with these goals through contributions on our Github repository.

Figure 1 :
Figure 1: SciWING Components: Text classification and sequence labelling Datasets, Models composed from low-level Modules, and Engines to train and record experiment parameters.Infer middleware to do inference and provide APIs.

2
Based on the official conlleval script from CoNLL.3 Commonly used for Python applications tures and experiment hyper-parameters in a single place.SciWING parses the TOML file and creates appropriate instances of the dataset, model and other experiment hyper-parameters.A simple configuration file for reference string parsing is shown in Figure2along with its equivalent model declaration in Python.The class declaration in python and configuration file have a oneto-one correspondence.As deep learning models are made of multiple modules, SciWING adopts a strategy to automatically instantiate these submodules as needed.SciWING constructs a Directed Acyclic Graph (DAG) from the model definition to achieve this.The DAG's topological ordering is used to instantiate the different submodules to form the final model, as described next.

Figure 2 :
Figure 2: (l) The model section of the TOML file, and (r) its corresponding class declaration in Python.

Figure 3 :
Figure 3: Web demo for SciWING's implementation of the Neural ParsCit reference string parsing model, where different parts of the reference string are labelled.This demo utilises displaCy visualisation toolkit, part of the spaCy 5 library.
4 https://fastapi.tiangolo.com/3Example TasksSciWING includes examples for various tasks which finds widespread use in scientific document processing.The examples demonstrate how to use the framework effectively.The models have verified performance levels that closely match the performance of the original results.They can be used as baselines for further research.
i t i a l i z e a t a g g e r w i t h o u t CRF 12 model = S i m p l e T a g g e r ( 13 r n n 2 s e q e n c o d e r = l s t m 2 s e q e n c o d e r , 14 e n c o d i n g d i m = 2 0 0 ) Bi-LSTM Tagger adding a CRF layer: We then modify the above code, swapping the simple tagger with one that uses a CRF.The rest of the code is identical.1 . . .
t a g g e r w i t h CRF on t o p 4 model = R n n S e q C r f T a g g e r ( 5 r n n 2 s e q e n c o d e r = l s t m 2 s e q e n c o d e r , 6 e n c o d i n g d i m = 200 7 ) d e m b e d d e r = WordEmbedder ( 3 e m b e d d i n g t y p e = " g l o v e 6 B 1 n c a t e n a t e t h e e m b e d d i n g s 16 embedder = C o n c a t E m b e d d e r s ( [ word embedder , c h a r e m b e d d e r , e l m o e m b e d d e r ] )