NLP Lean Programming Framework: Developing NLP Applications More Effectively

This paper presents NLP Lean Programming framework (NLPf), a new framework for creating custom Natural Language Processing (NLP) models and pipelines by utilizing common software development build systems. This approach allows developers to train and integrate domain-specific NLP pipelines into their applications seamlessly. Additionally, NLPf provides an annotation tool which improves the annotation process significantly by providing a well-designed GUI and sophisticated way of using input devices. Due to NLPf’s properties developers and domain experts are able to build domain-specific NLP application more effectively. Project page: https://gitlab.com/schrieveslaach/NLPf Video Tutorial: https://www.youtube.com/watch?v=44UJspVebTA (Demonstration starts at 11:40 min) This paper is related to: - Interfaces and resources to support linguistic annotation - Software architectures and reusable components - Software tools for evaluation or error analysis


Introduction
Nowadays more and more business models rely on the processing of natural language data, e. g. companies extract relevant eCommerce data from domain-specific documents. The required eCommerce data could be related to various domains, e. g. life-science, public utilities, or social media, depending on the companies' business models.
Furthermore, the World Wide Web (WWW) provides a huge amount of natural language data that provides a wide variety of knowledge to human readers. This amount of knowledge is unmanageable for humans and applications try to make this knowledge more accessible to humans, e. g. Treude and Robillard (2016) make natural language text about software programming more accessible through a natural language processing (NLP) application.
All these approaches have in common that they require domain-specific NLP models that have been trained on a domain-specific and annotated corpus. These models will be trained by using dif-ferent NLP frameworks and these models have to be evaluated for every annotation layer. For example, named entity recognition (NER) of Stanford CoreNLP (Manning et al., 2014) might work better than NER of OpenNLP (Reese, 2015, Chapter 1); the chosen segmentation tool, e. g. UD-Pipe (Straka and Straková, 2017), might work better than Stanford CoreNLP's segmentation tool, and so on. Existing studies show that domain specific training and evaluation is a common approach in the NLP community to determine the best-performing NLP pipeline (Buyko et al., 2006;Giesbrecht and Evert, 2009;Neunerdt et al., 2013;Omran and Treude, 2017).
Developers of NLP applications are forced to create domain-specific corpora to determine the best-performing NLP pipeline among many NLP frameworks. During this process they face various obstacles: • The training and evaluation of different NLP frameworks requires a lot of effort of scripting or programming because of incompatible APIs.
• Domain experts who annotate domainspecific documents with a GUI tool struggle with an insufficient user experience.
• There are too many combinations how developers can combine these NLP tools into NLP pipelines.
• The generated NLP models as a build artifact have to be integrated manually into the application code.
NLP Lean Programming framework (NLPf) addresses these issues. NLPf provides a standardized project structure for domain-specific corpora (see Section 2), an improved user experience for annotators (see Section 3), a common build process to train and evaluate NLP models in conjunc-1 tion with the determination of the best-performing NLP pipeline (see Section 4), and a convenient API to integrate the best-performing NLP pipeline into the application code (see Section 5).

Annotated Corpus Project Structure
Maven as a build management tool has standardized the development process of Java applications by standardizing the build life-cycle, standardizing the project layout, and standardizing the dependency management. These standardization are evolved by utilizing convention over configuration (CoC) as much as possible and developers have to make less decisions while developing software.
Such conventions are missing for the development of domain-specific NLP applications and developers have to make many decisions and have to write many scripts to build their applications. NLPf provides conventions by utilizing Maven and its project object model (POM). Listing 1 shows the basic project configuration to train and evaluate domain-specific NLP models with NLPf. Unlike standard Java projects this project uses the custom packaging method nlp-models which configures Maven to use NLPf's plugin (see nlp-maven-plugin) which trains and evaluates the domain-specific models. By convention, each document stored in src/main/corpus will be used as an input document for the training process and each document stored in src/test/corpus will used to evaluate the derived NLP models.
NLPf supports multiple document formats which need to be configured as Maven dependency (see io-odt in Listing 2). Most formats supported by DKPro Core 1 (de Castilho and Gurevych, 2014) are supported by NLPf but we recommend to use ODT documents because developers can just paste natural language text into the ODT documents and then annotate them without preparing specific document formats.

Quick Pad Tagger: Annotate Documents
NLPf provides the annotation tool Quick Pad Tagger (QPT) which provides a well-designed GUI, drawing the attention to the essential GUI elements of the annotation task. Figure 1 provides a screenshot of the QPT, showing how the user annotates named entities (NEs) in a document. At the bottom of the GUI the part of the document will be displayed and at the top of the screen the QPT shows a stream of tokens while the user can select multiple tokens (see blue boxes) to assign a NE type. Through the spinner on top of the stream of tokens the user chooses a type for each of the NEs. This design has been implemented consequently for each annotation layer and the design draws the attention to the actual important annotation task, e. g. assign NE types or part-of-speech (POS) tags to tokens.  The user can use a Xbox 360 controller to annotate the structure of natural language. This type of input device provides a more comfortable and playful user experience and in conjunction with the GUI design the annotation process is less painful and less exhausting. Additionally, the QPT provides a semi-automatic annotation process (Schreiber et al., 2015) which speeds up the annotation process further. In summary, the QPT reduces the required annotation time by half.

Install Best-performing NLP Pipeline Artifact
When documents of the corpus project have been annotated by annotators, developers can use a single command to train all available NLP tools, determine the best-performing NLP pipeline, and create an artifact which will be used in an NLP application (see Section 5). These steps will be performed by mvn install and the custom Maven plugin (see nlp-maven-plugin in Listing 1) passes following customized life-cycle: • At first, the Maven plugin validates the annotated documents, for example, it ensures that every or no token of a document have been annotated with a corresponding POS tag.
• After that, the Maven plugin looks up all available NLP trainer classes which are available on the classpath (c. f. de.tudarmstadt.ukp.dkpro.core.opennlp-asl in Listing 2). Each discovered trainer class will be used to create a domain-specific NLP model if the required annotations are available and the configuration will be stored in the target directory. The configurations are stored in a format compatible to the Unstructured Information Management Architecture (UIMA) framework (Ferrucci and Lally, 2004).
• If NLP tools do not provide any training, e. g. the segmentation tool of Stanford CoreNLP, developers can provide engine factories which create configurations for these tools (see Listing 3) which will be stored in the target directory.
• All available configurations will be used to create all possible domain-specific NLP pipeline configurations and each NLP pipeline will be evaluated with F 1 score by running the pipelines on the test documents and by comparing the results on the provided test annotations. The configuration of the best-performing NLP pipeline will be stored into the target directory.
• Based on the previous steps the Maven plugin creates a Java archive (JAR) which contains the NLP models and configuration of the best-performing NLP pipeline.
• Finally, the created JAR artifact can be installed or deployed into any Maven repository. The provided API integrates seamlessly into the API of the UIMA framework which provides an interface to run NLP components on unstructured data such as natural language text, c. f. method runPipeline in Listing 5. However, the bestperforming NLP has to be configured manually. NLPf's plumping JAR artifact provides the method createBestPerformingPipelineEngineDescription() which reads the configuration of the JAR that contains the configuration and models of the best-performing NLP pipeline. Listing 5: Example Application Java Code The example code provided in Listing 5 performs following steps, executed by runPipeline: • It reads an ODT file with the name plain.odt, c. f. readerDescription.
• Then, it runs the best-performing NLP pipeline which annotates the whole document with the natural language structure.
• Finally, it stores the annotations into an ODT file into the current directory, c. f.

writerDescription.
Developers can integrate custom analyses as they require them (see // integrate custom... in Listing 5). Therefore, they need to implement UIMA annotators which use the typesystem of DKPro Core. The conjunction of UIMA, DKPro Core, and NLPf allows developers to implement NLP applications effectively.

Summary
This paper provides a demonstration of NLP Lean Programming framework (NLPf) which enables developers to create domain-specific NLP pipelines more effectively, making less decisions through CoC. NLPf provides a standardized environment and the well-designed annotation tool Quick Pad Tagger (QPT) with an improved input mechanism to improve the annotation process. Additionally, the best-performing NLP pipeline will be determine through the convenient build tool Maven and the resulting artifact can be integrated as Maven dependency into any application conveniently.
NLPf is Open-source software, released under the LGPL version 3, and available at https:// gitlab.com/schrieveslaach/NLPf. All artifacts are available on Maven central and they can also be used with Jython in Python programs.