TARGER: Neural Argument Mining at Your Fingertips

We present TARGER, an open source neural argument mining framework for tagging arguments in free input texts and for keyword-based retrieval of arguments from an argument-tagged web-scale corpus. The currently available models are pre-trained on three recent argument mining datasets and enable the use of neural argument mining without any reproducibility effort on the user’s side. The open source code ensures portability to other domains and use cases.


Introduction
Argumentation is a multi-disciplinary field that extends from philosophy and psychology to linguistics as well as to artificial intelligence. Recent developments in argument mining apply natural language processing (NLP) methods to argumentation (Palau and Moens, 2011;Lippi and Torroni, 2016a) and are mostly focused on training classifiers on annotated text fragments to identify argumentative text units, such as claims and premises (Biran and Rambow, 2011;Habernal et al., 2014;Rinott et al., 2015). More specifically, current approaches mainly focus on three tasks: (1) detection of sentences containing argumentative units, (2) detection of the argumentative units' boundaries inside sentences, and (3) identifying relations between argumentative units.
Despite vital research in argument mining, there is a lack of freely available tools that enable users, especially non-experts, to make use of the field's recent advances. In this paper, we close this gap by introducing TARGER: a system with a userfriendly web interface 1 that can extract argumentative units in free input texts in real-time using 1 ltdemos.informatik.uni-hamburg.de/targer models trained on recent argument mining corpora with a highly configurable and efficient neural sequence tagger. TARGER's web interface and API also allow for very fast keyword-based argument retrieval from a pre-tagged version of the Common Crawl-based DepCC .
The native PyTorch implementation underlying TARGER has no external depencies and is available as open source software: 2 it can easily be incorporated into any existing NLP pipeline.

Related Work
There are three publicly available systems offering some functionality similar to TARGER. ArgumenText (Stab et al., 2018) is an argument search engine that retrieves argumentative sentences from the Common Crawl and labels them as pro or con given a keyword-based user query. Similarly, args.me (Wachsmuth et al., 2017) retrieves pro and con arguments from 300,000 arguments crawled from debating portals. Finally, MARGOT (Lippi and Torroni, 2016b) provides argument tagging for free-text inputs. However, answer times of MARGOT are rather slow for single input sentences (>5 seconds) and the F1 scores of 17.5 for claim detection and 16.7 for evidence detection are slightly worse compared to our approach (see Section 4.1).
TARGER offers a real-time retrieval functionality similar to ArgumenText and fast real-time freetext argument tagging with the option of switching between different pre-trained state-of-the-art models (MARGOT offers only a single one).

Architecture of TARGER
The independent components of the modular and flexible TARGER framework are shown in Fig BiLSTM-CNN-CRF sequence tagger is trained on different datasets yielding a variety of argument mining models (details in Section 3.1). As part of the preprocessing, the trained models are run on the 14 billion sentences of the DepCC corpus to tag and store argument unit information as additional fields in an Elasticsearch BM25F-index of the DepCC (details in Section 3.2). The online usage is handled via a Flask-based web app whose API accepts AJAX requests from the Web UI component or via API calls (details in Sections 3.3 and 3.4). The web interface is based on the named entity visualiser displaCy ENT. 3 The API routes free text inputs to the respective selected model to be tagged with argument information or it routes keyword-based queries to the index to retrieve sentences in which the query terms match argument units.

Neural Sequence Tagger
We implement a BiLSTM-CNN-CRF neural tagger (Ma and Hovy, 2016) for identifying argumentative units and for classifying them as claims or premises. The BiLSTM-CNN-CRF method is a popular sequence tagging approach and achieves (near) state-of-the-art performance for tasks like named entity recognition and part-of-speech tagging (Ma and Hovy, 2016;Lample et al., 2016); it has also been used for argument mining before (Eger et al., 2017). The general method relies on pre-computed word embeddings, a single bidirectional-LSTM/GRU recurrent layer, convolutional character-level embeddings to capture out-of-vocabulary words, and a first-order Condi-3 github.com/explosion/displacy-ent  tional Random Field (Lafferty et al., 2001) to capture dependencies between adjacent tags. Besides the existing BiLSTM-CNN-CRF implementation of Reimers and Gurevych (2017), we also use an own Python 3.6 / PyTorch 1.0 implementation that does not contain any third-party dependencies, has native vectorized code for efficient training and evaluation, and supports several input data formats as well as evaluation functions.

Retrieval Functionality
Our background collection for the retrieval of argumentative sentences is formed by the DepCC corpus , a linguistically pre-processed subset of the Common Crawl containing 14.3 billion unique English sentences from 365 million web documents. The trained WebD-GloVe model was run on all the sentences in the DepCC corpus since it performed best on the web data in a pilot experiment. The respective argumentative unit spans and labels were added as additional fields to an Elasticsearch BM25F-index of the DepCC.

TARGER API
To keep the TARGER framework modular and scalable while still allowing access to the models from external clients, online interaction is handled via a restful API. Each trained model is associated with a separate API endpoint accepting raw text as input. The output is provided as a list of wordlevel tokens with IOB-formatted labels for argument units (premises and claims) and the tagger's confidence scores for each label.

TARGER Web UI
The web interface of TARGER offers two functionalities: Analyze Text and Search Arguments. On the analysis tab (cf. Figure 2), the user can choose one of the deployed models to identify arguments in a user-provided free text. The result is shown with colored labels for different types of argumentative units (premises and claims) as well as de-   (Eger et al., 2017) and web discourse data (Habernal and Gurevych, 2017) to the best approaches from the original publications.
tected named entities (nested tags for entities in argumentative units are supported). Once a result is shown, it is possible to customize the display by enabling/disabling different labels without performing additional tagging runs. On the retrieval tab (cf. Figure 3), the user can enter a keyword query and choose whether it should be matched in claims, premises, etc. Every retrieved result is rendered as a text fragment colorized with argument and entity information just as on the analysis tab. To enable provenance, the URL of the source document is also provided.

Evaluation
To demonstrate that our neural tagger is able to reproduce the originally published argument mining performances, we compare the best performing of our pre-trained models (parameter settings at the end of Section 3.1) to the best performances from the original dataset publications. We also report on a pilot study using TARGER as a subroutine in runs for the TREC 2018 Common Core track. Table 3 shows a comparison of TARGER's best performing models (parameter settings at the end of Section 3.1) on the Persuasive Essays and the Web Discourse datasets to the best performance in the original publications. We apply the experimental settings of the original papers: a fixed 70/20/10 train/dev/test split on the Essays data, and a 10-fold cross-validation for Web Discourse (in our case allocating 7 folds for training and 2 for development in each iteration).

Experimental Results
On the Persuasive Essays dataset (paragraph level), the best TARGER model achieves a spanbased micro-F1 of 64.54 for extracted argument components matching the best performance of 64.74±1.97 reported by Eger et al. (2017) for their STag BLCC approach (BiLSTM-CRF-CNN approach (BLCC) similar to ours).  On the Web Discourse dataset, TARGER's best model's token-based macro-F1 of 24.20 slightly improves upon the originally reported best macro-F1 of 22.90 (Habernal and Gurevych, 2017) achieved by a structural support vector machine model SVM hmm for sequence labeling (Joachims et al., 2009). The SVM hmm model uses lexical, structural, and other handcrafted feature types. In contrast, TARGER just uses word embeddings since especially for cross-domain scenarios, handcrafted features show a strong tendency to overfit on the topics of the training texts (Habernal and Gurevych, 2017). Thus, we chose "word embeddings only" as a more robust feature type for our domain-agnostic general-purpose argument mining system (free input text and web data).
We cannot compare TARGER's performance on the IBM dataset to originally published performances since the tasks are different. Instead of TARGER's identification of claims and premises, Levy et al. (2018) focus on the identification of relevant premises for a given claim (called "topic" in the original publication). Still, a large number of potential general domain premises for the overall 150 topics (i.e., claims) are contained in the dataset, such that we transformed the original entries to a token-level claim and premise annotation. This way, only some 2500 distinct tokens were labeled as not argumentative (e.g., punctuation) while the vast majority are tokens in claims and premises (but the only 150 different claims are heavily duplicated). Not surprisingly-given the class imbalance and duplication-, the resulting trained TARGER models "optimistically" iden-  tify some argumentative units in almost every input text. We still provide the models as a starting point with the intention to de-duplicate the data and to add more non-argumentative text passages for a more balanced / realistic training scenario.

TARGER @ TREC Common Core Track
As a proof of concept, we used TARGER's model pre-trained on essays with dependencybased embeddings in a TREC 2018 Common Core track submission (Bondarenko et al., 2018). The TARGER API served as a subroutine in a pipeline axiomatically re-ranking (Hagen et al., 2016) BM25F retrieval results with respect to their argumentativeness (presence/absence of arguments). For the Washington Post corpus used in the track, the dependency-based essays model best tagged argumentative units in a small pilot study. Out of 25 topics manually labeled as argumentative from the 50 Common Core track topics, the TARGER-based argumentativeness re-ranking improved the retrieval quality by > 0.05 nDCG@10 for 4 topics (see Table 4). Argumentativenessbased re-ranking might thus be a viable way to integrate neural argument mining into the retrieval process-for instance, using TARGER.

Conclusion
We have presented TARGER: an open source system for tagging arguments in free text and for retrieving arguments from a web-scale corpus. With the available RESTful API and the web interface, we make the recent argument mining technologies more accessible and usable to researchers and developers as well as the general public. The different argument mining models can easily be used to perform manual text analyses or can seamlessly be integrated into automatic NLP pipelines. New taggers can be deployed to TARGER at any time, so that users can have the state of the art in argument mining at their fingertips. For future work, we plan to integrate contextualized embeddings with ELMo-and BERT-based models (Peters et al., 2018;Devlin et al., 2018).
Finally, by looking at our experimental results as well as tagging examples for free input texts or the DepCC web data, we noticed that despite the recent advances in argument mining, there is still considerable headroom to improve in-domain, but especially out-of-domain argument tagging performance. Free input texts of different styles or genres taken from the web are tagged very inconsistently by the different models. More research on domain adaptation and transfer learning (Ruder, 2019) in the scenario of argument mining needs to address this issue.