AllenNLP: A Deep Semantic Natural Language Processing Platform

Modern natural language processing (NLP) research requires writing code. Ideally this code would provide a precise definition of the approach, easy repeatability of results, and a basis for extending the research. However, many research codebases bury high-level parameters under implementation details, are challenging to run and debug, and are difficult enough to extend that they are more likely to be rewritten. This paper describes AllenNLP, a library for applying deep learning methods to NLP research that addresses these issues with easy-to-use command-line tools, declarative configuration-driven experiments, and modular NLP abstractions. AllenNLP has already increased the rate of research experimentation and the sharing of NLP components at the Allen Institute for Artificial Intelligence, and we are working to have the same impact across the field.


Introduction
Neural network models are now the state-of-theart for a wide range of tasks such as text classification (Howard and Ruder, 2018), machine translation (Vaswani et al., 2017), semantic role labeling (Zhou and Xu, 2015;He et al., 2017), coreference resolution (Lee et al., 2017a), and semantic parsing (Krishnamurthy et al., 2017). However it can be surprisingly difficult to tune new models or replicate existing results. State-of-the-art deep learning models often take over a week to train on modern GPUs and are sensitive to initialization and hyperparameter settings. Furthermore, reference implementations often re-implement NLP components from scratch and make it difficult to reproduce results, creating a barrier to entry for research on many problems.
AllenNLP, a platform for research on deep learning methods in natural language processing, is designed to address these problems and to significantly lower barriers to high quality NLP research by • implementing useful NLP abstractions that make it easy to write higher-level model code for a broad range of NLP tasks, swap out components, and re-use implementations, • handling common NLP deep learning problems, such as masking and padding, and keeping these low-level details separate from the high-level model and experiment definitions, • defining experiments using declarative configuration files, which provide a high-level summary of a model and its training, and make it easy to change the deep learning architecture and tune hyper-parameters, and • sharing models through live demos, making complex NLP accessible and debug-able.
The AllenNLP website 1 provides tutorials, API documentation, pretrained models, and source code 2 . The AllenNLP platform has a permissive Apache 2.0 license and is easy to download and install via pip, a Docker image, or cloning the GitHub repository. It includes reference implementations for recent state-of-the-art models (see Section 3) that can be easily run (to make predictions on arbitrary new inputs) and retrained with different parameters or on new data. These pretrained models have interactive online demos 3 with visualizations to help interpret model decisions and make predictions accessible to others. The reference implementations provide examples of the framework functionality (Section 2) and also serve as baselines for future research.
AllenNLP is an ongoing open-source effort maintained by several full-time engineers and researchers at the Allen Institute for Artificial Intelligence, as well as interns from top PhD programs and contributors from the broader NLP community. It is used widespread internally for research on common sense, logical reasoning, and stateof-the-art NLP components such as: constituency parsers, semantic parsing, and word representations. AllenNLP is gaining traction externally and we want to invest to make it the standard for advancing NLP research using PyTorch.

Library Design
AllenNLP is a platform designed specifically for deep learning and NLP research. AllenNLP is built on PyTorch (Paszke et al., 2017), which provides many attractive features for NLP research. PyTorch supports dynamic networks, has a clean "Pythonic" syntax, and is easy to use.
The AllenNLP library provides (1) a flexible data API that handles intelligent batching and padding, (2) high-level abstractions for common operations in working with text, and (3) a modular and extensible experiment framework that makes doing good science easy.
AllenNLP maintains a high test coverage of over 90% 4 to ensure its components and models are working as intended. Library features are built with testability in mind so new components can maintain a similar test coverage.

Text Data Processing
AllenNLP's data processing API is built around the notion of Fields. Each Field represents a single input array to a model. Fields are grouped together in Instances that represent the examples for training or prediction.
The Field API is flexible and easy to extend, allowing for a unified data API for tasks as diverse as tagging, semantic role labeling, question answering, and textual entailment. To represent the SQuAD dataset (Rajpurkar et al., 2016), for example, which has a question and a passage as inputs and a span from the passage as output, each training Instance comprises a TextField for the question, a TextField for the passage, and a SpanField representing the start and end positions of the answer in the passage.
The user need only read data into a set of Instance objects with the desired fields, and the library can automatically sort them into batches with similar sequence lengths, pad all sequences in each batch to the same length, and randomly shuffle the batches for input to a model.

NLP-Focused Abstractions
AllenNLP provides a high-level API for building models, with abstractions designed specifically for NLP research. By design, the code for a model actually specifies a class of related models. The researcher can then experiment with various architectures within this class by simply changing a configuration file, without having to change any code.
The library has many abstractions that encapsulate common decision points in NLP models. Key examples are: (1) how text is represented as vectors, (2) how vector sequences are modified to produce new vector sequences, (3) how vector sequences are merged into a single vector.
TokenEmbedder: This abstraction takes input arrays generated by e.g. a TextField and returns a sequence of vector embeddings. Through the use of polymorphism and AllenNLP's experiment framework (see Section 2.3), researchers can easily switch between a wide variety of possible word representations. Simply by changing a configuration file, an experimenter can choose between pre-trained word embeddings, word embeddings concatenated with a character-level CNN encoding, or even pre-trained model token-incontext embeddings (Peters et al., 2017), which allows for easy controlled experimentation.
Seq2SeqEncoder: A common operation in deep NLP models is to take a sequence of word vectors and pass them through a recurrent network to encode contextual information, producing a new sequence of vectors as output. There is a large number of ways to do this, including LSTMs (Hochreiter and Schmidhuber, 1997), GRUs (Cho et al., 2014), intra-sentence attention (Cheng et al., 2016), recurrent additive networks (Lee et al., 2017b), and many more. Al-lenNLP's Seq2SeqEncoder abstracts away the decision of which particular encoder to use, allowing the user to build an encoder-agnostic model and specify the encoder via configuration. In this way, a researcher can easily explore new recurrent architectures; for example, they can replace the LSTMs in any model that uses this abstraction with any other encoder, measuring the impact across a wide range of models and tasks.
Seq2VecEncoder: Another common operation in NLP models is to merge a sequence of vectors into a single vector, using either a recurrent network with some kind of averaging or pooling, or using a convolutional network. This operation is encapsulated in AllenNLP by a Seq2VecEncoder. This abstraction again allows the model code to only describe a class of similar models, with particular instantiations of that model class being determined by a configuration file.
SpanExtractor: A recent trend in NLP is to build models that operate on spans of text, instead of on tokens. State-of-the-art models for coreference resolution (Lee et al., 2017a), constituency parsing (Stern et al., 2017), and semantic role labeling (He et al., 2017) all operate in this way. Support for building this kind of model is built into AllenNLP, including a SpanExtractor abstraction that determines how span vectors get computed from sequences of token vectors.

Experimental Framework
The primary design goal of AllenNLP is to make it easy to do good science with controlled experiments. Because of the abstractions described in Section 2.2, large parts of the model architecture and training-related hyper-parameters can be configured outside of model code. This makes it easy to clearly specify the important decisions that define a new model in configuration, and frees the researcher from needing to code all of the implementation details from scratch.
This architecture design is accomplished in Al-lenNLP using a HOCON 5 configuration file that specifies, e.g., which text representations and encoders to use in an experiment. The mapping from strings in the configuration file to instantiated objects in code is done through the use of a registry, which allows users of the library to add new im-plementations of any of the provided abstractions, or even to create their own new abstractions.
While some entries in the configuration file are optional, many are required and if unspecified AllenNLP will raise a ConfigurationError when reading the configuration. Additionally, when a configuration file is loaded, AllenNLP logs the configuration values, providing a record of both specified and default parameters for your model.

Reference Models
AllenNLP includes reference implementations of widely used language understanding models. These models demonstrate how to use the framework functionality presented in Section 2. They also have verified performance levels that closely match the original results, and can serve as comparison baselines for future research.
AllenNLP includes reference implementations for several tasks, including: • Semantic Role Labeling (SRL) models recover the latent predicate argument structure of a sentence (Palmer et al., 2005). SRL builds representations that answer basic questions about sentence meaning; for example, "who" did "what" to "whom." The Al-lenNLP SRL model is a re-implementation of a deep BiLSTM model (He et al., 2017). The implemented model closely matches the published model which was state of the art in 2017, achieving a F1 of 78.9% on English Ontonotes 5.0 dataset using the CoNLL 2011/12 shared task format.
• Machine Comprehension (MC) systems take an evidence text and a question as input, and predict a span within the evidence that answers the question. AllenNLP includes a reference implementation of the BiDAF MC model (Seo et al., 2017) which was state of the art for the SQuAD benchmark (Rajpurkar et al., 2016) in early 2017.
• Textual Entailment (TE) models take a pair of sentences and predict whether the facts in the first necessarily imply the facts in the second. The AllenNLP TE model is a re-implementation of the decomposable attention model (Parikh et al., 2016), a widely used TE baseline that was state-of-the-art on the SNLI dataset (Bowman et al., 2015) in late 2016. The AllenNLP TE model achieves an accuracy of 86.4% on the SNLI 1.0 test dataset, a 2% improvement on most publicly available implementations and a similar score as the original paper. Rather than pre-trained Glove vectors, this model uses ELMo embeddings (Peters et al., 2018), which are completely character based and account for the 2% improvement.
• A Constituency Parser breaks a text into sub-phrases, or constituents. Non-terminals in the tree are types of phrases and the terminals are the words in the sentence. The AllenNLP constituency parser is an implementation of a minimal neural model for constituency parsing based on an independent scoring of labels and spans (Stern et al., 2017). This model uses ELMo embeddings (Peters et al., 2018), which are completely character based and improves single model performance from 92.6 F1 to 94.11 F1 on the Penn Tree bank, a 20% relative error reduction.
AllenNLP also includes a token embedder that uses pre-trained ELMo (Peters et al., 2018) representations. ELMo is a deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics) and how these uses vary across linguistic contexts (in order to model polysemy). ELMo embeddings significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment, and sentiment analysis.
Additional models are currently under development and are regularly released, including semantic parsing (Krishnamurthy et al., 2017) and multi-paragraph reading comprehension (Clark and Gardner, 2017). We expect the number of tasks and reference implementations to grow steadily over time. The most up-todate list of reference models is maintained at http://allennlp.org/models.

Related Work
Many existing NLP pipelines, such as Stanford CoreNLP (Manning et al., 2014) and spaCy 6 , focus on predicting linguistic structures rather 6 https://spacy.io/ than modeling NLP architectures. While Al-lenNLP supports making predictions using pretrained models, its core focus is on enabling novel research.
This emphasis on configuring parameters, training, and evaluating is similar to Weka (Witten and Frank, 1999) or Scikitlearn (Pedregosa et al., 2011), but AllenNLP focuses on cutting-edge research in deep learning and is designed around declarative configuration of model architectures in addition to model parameters.
Most existing deep-learning toolkits are designed for general machine learning (Bergstra et al., 2010;Yu et al., 2014;Chen et al., 2015;Abadi et al., 2016;Neubig et al., 2017), and can require significant effort to develop research infrastructure for particular model classes. Some, such as Keras (Chollet et al., 2015), do aim to make it easy to build deep learning models. Similar to how AllenNLP is an abstraction layer on top of PyTorch, Keras provides high-level abstractions on top of static graph frameworks such as Tensor-Flow. While Keras' abstractions and functionality are useful for general machine learning, they are somewhat lacking for NLP, where input data types can be very complex and dynamic graph frameworks are more often necessary.
Finally, AllenNLP is related to toolkits for deep learning research in dialog (Miller et al., 2017) and machine translation . Those toolkits support learning general functions that map strings (e.g. foreign language text or user utterances) to strings (e.g. English text or system responses). AllenNLP, in contrast, is a more general library for building models for any kind of NLP task, including text classification, constituency parsing, textual entailment, question answering, and more.

Conclusion
The design of AllenNLP allows researchers to focus on the high-level summary of their models rather than the details, and to do careful, reproducible research. Internally at the Allen Institute for Artificial Intelligence the library is widely adopted and has improved the quality of our research code, spread knowledge about deep learning, and made it easier to share discoveries between teams. AllenNLP is gaining traction externally and is growing an open-source community of contributors 7 . The AllenNLP team is committed to continuing work on this library in order to enable better research practices throughout the NLP community and to build a community of researchers who maintain a collection of the best models in natural language processing.