Argument Mining with Structured SVMs and RNNs

We propose a novel factor graph model for argument mining, designed for settings in which the argumentative relations in a document do not necessarily form a tree structure. (This is the case in over 20% of the web comments dataset we release.) Our model jointly learns elementary unit type classification and argumentative relation prediction. Moreover, our model supports SVM and RNN parametrizations, can enforce structure constraints (e.g., transitivity), and can express dependencies between adjacent relations and propositions. Our approaches outperform unstructured baselines in both web comments and argumentative essay datasets.


Introduction
Argument mining consists of the automatic identification of argumentative structures in documents, a valuable task with applications in policy making, summarization, and education, among others. The argument mining task includes the tightly-knit subproblems of classifying propositions into elementary unit types and detecting argumentative relations between the elementary units. The desired output is a document argumentation graph structure, such as the one in Figure 1, where propositions are denoted by letter subscripts, and the associated argumentation graph shows their types and support relations between them.
Most annotation and prediction efforts in argument mining have focused on tree or forest structures (Peldszus and Stede, 2015;Stab and Gurevych, 2016), constraining argument structures to form one or more trees. This makes the problem computationally easier by enabling the use of maximum spanning tree-style parsing ap- proaches. However, argumentation in the wild can be less well-formed. The argument put forth in Figure 1, for instance, consists of two components: a simple tree structure and a more complex graph structure (c jointly supports b and d).
In this work, we design a flexible and highly expressive structured prediction model for argument mining, jointly learning to classify elementary units (henceforth propositions) and to identify the argumentative relations between them (henceforth links). By formulating argument mining as inference in a factor graph (Kschischang et al., 2001), our model (described in Section 4) can account for correlations between the two tasks, can consider second order link structures (e.g., in Figure 1, c → b → a), and can impose arbitrary constraints (e.g., transitivity). To parametrize our models, we evaluate two alternative directions: linear structured SVMs (Tsochantaridis et al., 2005), and recurrent neural networks with structured loss, extending (Kiperwasser and Goldberg, 2016). Interestingly, RNNs perform poorly when trained with classification losses, but become competitive with the featureengineered structured SVMs when trained within our proposed structured learning model.
We evaluate our approach on two argument mining datasets. Firstly, on our new Cornell eRulemaking Corpus -CDCP, 2 consisting of argument annotations on comments from an eRulemaking discussion forum, where links don't always form trees (Figure 1 shows an abridged example comment, and Section 3 describes the dataset in more detail). Secondly, on the UKP argumentative essays v2 (henceforth UKP), where argument graphs are annotated strictly as multiple trees (Stab and Gurevych, 2016). In both cases, the results presented in Section 5 confirm that our models outperform unstructured baselines. On UKP, we improve link prediction over the best reported result in (Stab and Gurevych, 2016), which is based on integer linear programming postprocessing. For insight into the strengths and weaknesses of the proposed models, as well as into the differences between SVM and RNN parameterizations, we perform an error analysis in Section 5.1. To support argument mining research, we also release our Python implementation, Marseille. 3

Related work
Our factor graph formulation draws from ideas previously used independently in parsing and argument mining. In particular, maximum spanning tree (MST) methods for arc-factored dependency parsing have been successfully used by McDonald et al. (2005) and applied to argument mining with mixed results by Peldszus and Stede (2015). As they are not designed for the task, MST parsers cannot directly handle proposition classification or model the correlation between proposition and link prediction-a limitation our model addresses. Using RNN features in an MST parser with a structured loss was proposed by Kiperwasser and Goldberg (2016); their model can be seen as a particular case of our factor graph approach, limited to link prediction with a tree structure constraint. Our models support multi-task learning for proposition classification, parameter-izing adjacent links with higher-order structures (e.g., c → b → a) and enforcing arbitrary constraints on the link structure, not limited to trees. Such higher-order structures and logic constraints have been successfully used for dependency and semantic parsing by Martins et al. (2013) and Martins and Almeida (2014); to our knowledge we are the first to apply them to argument mining, as well as the first to parametrize them with neural networks. Stab and Gurevych (2016) used an integer linear program to combine the output of independent proposition and link classifiers using a hand-crafted scoring formula, an approach similar to our baseline. Our factor graph method can combine the two tasks in a more principled way, as it fully learns the correlation between the two tasks without relying on hand-crafted scoring, and therefore can readily be applied to other argumentation datasets. Furthermore, our model can enforce the tree structure constraint, required on the UKP dataset, using MST cycle constraints used by Stab and Gurevych (2016), thanks to the AD 3 inference algorithm (Martins et al., 2015).
Sequence tagging has been applied to the related structured tasks of proposition identification and classification (Stab and Gurevych, 2016;Habernal and Gurevych, 2016;Park et al., 2015b); integrating such models is an important next step. Meanwhile, a new direction in argument mining explores pointer networks (Potash et al., 2016); a promising method, currently lacking support for tree structures and domain-specific constraints.

Data
We release a new argument mining dataset consisting of user comments about rule proposals regarding Consumer Debt Collection Practices (CDCP) by the Consumer Financial Protection Bureau collected from an eRulemaking website, http:// regulationroom.org.
Argumentation structures found in web discussion forums, such as the eRulemaking one we use, can be more free-form than the ones encountered in controlled, elicited writing such as (Peldszus and Stede, 2015). For this reason, we adopt the model proposed by Park et al. (2015a), which does not constrain links to form tree structures, but unrestricted directed graphs. Indeed, over 20% of the comments in our dataset exhibit local structures that would not be allowable in a tree. Possible link types are reason and evidence, and propo-sition types are split into five fine-grained categories: POLICY and VALUE contain subjective judgements/interpretations, where only the former specifies a specific course of action to be taken. On the other hand, TESTIMONY and FACT do not contain subjective expressions, the former being about personal experience, or "anecdotal." Lastly, REFER-ENCE covers URLs and citations, which are used to point to objective evidence in an online setting.
In comparison, the UKP dataset (Stab and Gurevych, 2016) only makes the syntactic distinction between CLAIM, MAJOR CLAIM, and PREMISE types, but it also includes attack links. The permissible link structure is stricter in UKP, with links constrained in annotation to form one or more disjoint directed trees within each paragraph. Also, since web arguments are not necessarily fully developed, our dataset has many argumentative propositions that are not in any argumentation relations. In fact, it isn't unusual for comments to have no argumentative links at all: 28% of CDCP comments have no links, unlike UKP, where all essays have complete argument structures. Such comments with no links make the problem harder, emphasizing the importance of capturing the lack of argumentative support, not only its presence.

Annotation results
Each user comment was annotated by two annotators, who independently annotated the boundaries and types of propositions, as well as the links among them. 4 To produce the final corpus, a third annotator manually resolved the conflicts, 5 and two automatic preprocessing steps were applied: we take the link transitive closure, and we remove a small number of nested propositions. 6 The resulting dataset contains 731 comments, consisting of about 3800 sentences (≈4700 propositions) and 88k words. Out of the 43k possible pairs of propositions, links are present between only 1300 (roughly 3%). In comparison, UKP has fewer documents (402), but they are longer, with a total of 7100 sentences (6100 propositions) and 147k words. Since UKP links only occur within the same paragraph and propositions not connected to the argument are removed in a preprocessing step, link prediction is less imbalanced in UKP, with 3800 pairs of propositions being linked out of a total of 22k (17%). We reserve a test set of 150 documents (973 propositions, 272 links) from CDCP, and use the provided 80-document test split from UKP (1266 propositions, 809 links).

Structured learning
for argument mining

Preliminaries
Binary and multi-class classification have been applied with some success to proposition and link prediction separately, but we seek a way to jointly learn the argument mining problem at the document level, to better model contextual dependencies and constraints. We therefore turn to structured learning, a framework that provides the desired level of expressivity. In general, learning from a dataset of documents x i ∈ X and their associated labels y i ∈ Y involves seeking model parameters w that can "pick out" the best label under a scoring function f : (1) Unlike classification or regression, where X is usually a feature space R d and Y ⊆ R (e.g., we predict an integer class index or a probability), in structured learning, more complex inputs and outputs are allowed. This makes the arg max in Equation 1 impossible to evaluate by enumeration, so it is desirable to find models that decompose over smaller units and dependencies between them; for instance, as factor graphs. In this section, we give a factor graph description of our proposed structured model for argument mining.

Model description
An input document is a string of words with proposition offsets delimited. We denote the propositions in a document by {a, b, c, ...} and the possible directed link between a and b as a → b.
The argument structure we seek to predict consists of the type of each proposition y a ∈ P and a binary label for each link Figure 2: Factor graphs for a document with three propositions (a, b, c) and the six possible edges between them, and some of the factors used, illustrating differences and similarities between our models for the two datasets. Unary factors are light gray; compatibility factors are black. Factors not part of the basic model have curved edges: higher-order factors are orange and on the right; link structure factors are hollow, as that they don't have any parameters. Strict constraint factors are omitted for simplicity.
The possible proposition types P differ for the two datasets; such differences are documented in Table 1. As we describe the variables and factors constituting a document's factor graph, we shall refer to Figure 2 for illustration.
Unary potentials. Each proposition a and each link a → b has a corresponding random variable in the factor graph (the circles in Figure 2). To encode the model's belief in each possible value for these variables, we parametrize the unary factors (gray boxes in Figure 2) with unary potentials: φ(a) ∈ R |P| is a score of y a for each possible proposition type. Similarly, link unary potentials φ(a → b) ∈ R |R| are scores for y a→b being on/off. Without any other factors, this would amount to independent classifiers for each task.
Compatibility factors. For every possible link a → b, the variables (a, b, a → b) are bound by a dense factor scoring their joint assignment (the black boxes in Figure 2). Such a factor could automatically learn to encourage links from compatible types (e.g., from TESTIMONY to POLICY) or discourage links between less compatible ones (e.g., from FACT to TESTIMONY). In the simplest form, this factor would be parametrized as a tensor T ∈ R |P|×|P|×|R| , with t ijk retaining the score of a source proposition of type i to be (k = on) or not to be (k = off) in a link with a proposition of type j. For more flexibility, we parametrize this factor with compatibility features depending only on simple structure: t ijk becomes a vector, and the score of configuration (i, j, k) is given by v ab t ijk where v ab consists of three binary features: • bias: a constant value of 1, allowing T to learn a base score for a label configuration (i, j, k), as in the simple form above, • adjacency: when there are no other propositions between the source and the target, • order: when the source precedes the target.
Second order factors. Local argumentation graph structures such as a → b → c might be modeled better together rather than through separate link factors for a → b and b → c. As in higher-order structured models for semantic and dependency parsing (Martins et al., 2013; Martins and Almeida, 2014), we implement three types of second order factors: . Not all of these types of factors make sense on all datasets: as sibling structures cannot exist in directed trees, we don't use sibling factors on UKP. On CDCP, by transitivity, every grandparent structure implies a corresponding sibling, so it is sufficient to parametrize siblings. This difference between datasets is emphasized in Figure 2, where one example of each type of factor is pictured on the right side of the graphs (orange boxes with curved edges): on CDCP we illustrate a coparent factor (top right) and a sibling factor (bot-tom right), while on UKP we show a co-parent factor (top right) and a grandparent factor (bottom right). We call these factors second order because they involve two link variables, scoring the joint assignment of both links being on.
Valid link structure. The global structure of argument links can be further constrained using domain knowledge. We implement this using constraint factors; these have no parameters and are denoted by empty boxes in Figure 2. In general, well-formed arguments should be cycle-free. In the UKP dataset, links form a directed forest and can never cross paragraphs. This particular constraint can be expressed as a series of tree factors, 8 one for each paragraph (the factor connected to all link variables in Figure 2). In CDCP, links do not form a tree, but we use logic constraints to enforce transitivity (top left factor in Figure 2) and to prevent symmetry (bottom left); the logic formulas implemented by these factors are described in Table 1. Together, the two constraints have the desirable side effect of preventing cycles.
Strict constraints. We may include further domain-specific constraints into the model, to express certain disallowed configurations. For instance, proposition types that appear in CDCP data can be ordered by the level of objectivity (Park et al., 2015a), as shown in Table 1. In a wellformed argument, we would want to see links from more objective to equally or less objective propositions: it's fine to provide FACT as reason for VALUE, but not the other way around. While the training data sometimes violates this constraint, enforcing it might provide a useful inductive bias.
Inference. The arg max in Equation 1 is a MAP over a factor graph with cycles and many overlapping factors, including logic factors. While exact inference methods are generally unavailable, our setting is perfectly suited for the Alternating Directions Dual Decomposition (AD 3 ) algorithm: approximate inference on expressive factor graphs with overlapping factors, logic constraints, and generic factors (e.g., directed tree factors) defined through maximization oracles (Martins et al., 2015). When AD 3 returns an integral solution, it is globally optimal, but when solutions are frac-tional, several options are available. At test time, for analysis, we retrieve exact solutions using the branch-and-bound method. At training time, however, fractional solutions can be used as-is; this makes better use of each iteration and actually increases the ratio of integral solutions in future iterations, as well as at test time, as proven by Meshi et al. (2016). We also find that after around 15 training iterations with fractional solutions, over 99% of inference calls are integral.
Learning. We train the models by minimizing the structured hinge loss (Taskar et al., 2004): (2) where ρ is a configurable misclassification cost. The max in Equation 2 is not the same as the one used for prediction, in Equation 1. However, when the cost function ρ decomposes over the variables, cost-augmented inference amounts to regular inference after augmenting the potentials accordingly. We use a weighted Hamming cost: where v is summed over all variables in a document {a} ∪ {a → b}, and ρ(y v ) is a misclassification cost. We assign uniform costs ρ to 1 for all mistakes except false-negative links, where we use higher cost proportional to the class imbalance in the training split, effectively giving more weight to positive links during training.

Argument structure SVM
One option for parameterizing the potentials of the unary and higher-order factors is with linear models, using proposition, link, and higher-order features. This gives birth to a linear structured SVM (Tsochantaridis et al., 2005), which, when using l 2 regularization, can be trained efficiently in the dual using the online block-coordinate Frank-Wolfe algorithm of Lacoste-Julien et al. (2013), as implemented in the pystruct library (Müller and Behnke, 2014). This algorithm is more convenient than subgradient methods, as it does not require tuning a learning rate parameter.
directed forest: • TREEFACTOR over each paragraph • zero-potential "root" links a → * strict constraints link source must be as least as objective as the target: link source must be premise: lexical (unigrams and dependency tuples), structural (token statistics and proposition location), indicators (from hand-crafted lexicons), contextual, syntactic (subclauses, depth, tense, modal, and POS), probability, discourse (Lin et al., 2014), and average GloVe embeddings (Pennington et al., 2014). Link features are lexical (unigrams), syntactic (POS and productions), structural (token statistics, proposition statistics and location features), hand-crafted indicators, discourse triples, PMI, and shared noun counts. Our proposed higher-order factors for grandparent, co-parent, and sibling structures require features extracted from a proposition triplet a, b, c. In dependency and semantic parsing, higher-order factors capture relationships between words, so sparse indicator features can be efficiently used. In our case, since propositions consist of many words, BOW features may be too noisy and too dense; so for simplicity we again take a cue from the link-specific features used by Stab and Gurevych (2016). Our higher-order factor features are: same sentence indicators (for all 3 and for each pair), proposition order (one for each of the 6 possible orderings), Jaccard similarity (between all 3 and between each pair), presence of any shared nouns (between all 3 and between each pair), and shared noun ratios: nouns shared by all 3 divided by total nouns in each proposition and each pair, and shared nouns between each pair with respect to each proposition. Up to vocabulary size difference, our total feature dimensionality is approximately 7000 for propositions and 2100 for links. The number of second order features is 35.
Hyperparameters. We pick the SVM regularization parameter C ∈ {0.001, 0.003, 0.01, 0.03, 0.1, 0.3} by k-fold cross validation at document level, optimizing for the average between link and proposition F 1 scores.

Argument structure RNN
Neural network methods have proven effective for natural language problems even with minimalto-no feature engineering. Inspired by the use of LSTMs (Hochreiter and Schmidhuber, 1997) for MST dependency parsing by Kiperwasser and Goldberg (2016), we parametrize the potentials in our factor graph with an LSTM-based neural network, 9 replacing MST inference with the more general AD 3 algorithm, and using relaxed solutions for training when inference is inexact.
We extract embeddings of all words with a corpus frequency > 1, initialized with GloVe word vectors. We use a deep bidirectional LSTM to encode contextual information, representing a proposition a as the average of the LSTM outputs of its words, henceforth denoted ↔ a.
Proposition potentials. We apply a multi-layer perceptron (MLP) with rectified linear activations to each proposition, with all layer dimensions equal except the final output layer, which has size |P| and is not passed through any nonlinearities.
Link potentials. To score a dependency a → b, Kiperwasser and Goldberg (2016) 0 . Since the bilinear expression returns a scalar, but the link potentials must have a value for both the on and off states, we set the full potential to is a learned scalar bias. We initialize W to the diagonal identity matrix.
Second order potentials. Grandparent potentials φ(a → b → c) score two adjacent directed edges, in other words three propositions. We again first pass each proposition representation through a slot-specific dense layer. We implement a multilinear scorer analogously to the link potentials: where W = (w) ijk is a third-order cube tensor. To reduce the large numbers of parameters, we implicitly represent W as a rank r tensor: ks . Notably, this model captures only third-order interactions between the representation of the three propositions. To capture first-order "bias" terms, we could include slotspecific linear terms, e.g., w a a; but to further capture quadratic backoff effects (for instance, if two propositions carry a strong signal of being siblings regardless of their parent), we would require quadratically many parameters. Instead of explicit lower-order terms, we propose augmenting a, b, and c with a constant feature of 1, which has approximately the same effect, while benefiting from the parameter sharing in the low-rank factorization; an effect described by . Siblings and co-parents factors are similarly parametrized with their own tensors.
Hyperparameters. We perform grid search using k-fold document-level cross-validation, tuning the dropout probability in the dense MLP layers over {0.05, 0.1, 0.15, 0.2, 0.25} and the optimal number of passes over the training data over {10, 25, 50, 75, 100}. We use 2 layers for the LSTM and the proposition classifier, 128 hidden units in all layers, and a multilinear decomposition with rank r = 16, after preliminary CV runs.

Baseline models
We compare our proposed models to equivalent independent unary classifiers. The unary-only version of a structured SVM is an l 2 -regularized linear SVM. 10 For the RNN, we compute unary potentials in the same way as in the structured model, but apply independent hinge losses at each variable, instead of the global structured hinge loss. Since the RNN weights are shared, this is a form of multi-task learning. The baseline predictions can be interpreted as unary potentials, therefore we can simply round their output to the highest scoring labels, or we can, alternatively, perform testtime inference, imposing the desired structure.

Results
We evaluate our proposed models on both datasets. For model selection and development we used kfold cross-validation at document level: on CDCP we set k = 3 to avoid small validation folds, while on UKP we follow Stab and Gurevych (2016) setting k = 5. We compare our proposed structured learning systems (the linear structured SVM and the structured RNN) to the corresponding baseline versions. We organize our experiments in three incremental variants of our factor graph: basic, full, and strict, each with the following components: 11 component basic full strict (baseline) unaries compat. factors compat. features higher-order link structure strict constraints Following Stab and Gurevych (2016), we compute F 1 scores at proposition and link level, and also report their average as a summary of overall performance. 12 The results of a single prediction run on the test set are displayed in Table 2. The overall trend is that training using a structured objective is better than the baseline models, even when structured inference is applied on the baseline predictions. On UKP, for link prediction, the linear baseline can reach good performance when using inference, similar to the approach of Stab and Gurevych (2016), but the improvement in proposition prediction leads to higher overall F 1 for the structured models. Meanwhile, on the more difficult CDCP setting, performing inference on the baseline output is not competitive. While feature engineering still outperforms our RNN model, we find that RNNs shine on proposition classification, especially on UKP, and that structured training can make them competitive, reducing their observed lag on link prediction (Katiyar and Cardie, 2016), possibly through mitigating class imbalance.  Table 2: Test set F 1 scores for link and proposition classification, as well as their average, on the two datasets. The number of test instances is shown in parentheses; best scores on overall tasks are in bold.

Discussion and analysis
Contribution of compatibility features. The compatibility factor in our model can be visualized as conditional odds ratios given the source and target proposition types. Since there are only four possible configurations of the compatibility features, we can plot all cases in Figure 3, alongside the basic model. Not using compatibility features, the basic model can only learn whether certain configurations are more likely than others (e.g. a REFERENCE supporting another REFERENCE is unlikely, while a REFERENCE supporting a FACT is more likely; essentially a soft version of our domain-specific strict constraints. The full model with compatibility features is finer grained, capturing, for example, that links from REFERENCE to FACT are more likely when the reference comes after, or that links from VALUE to POLICY are extremely likely only when the two are adjacent.
Proposition errors. The confusion matrices in Figure 4 reveal that the most common confusion is misclassifying FACT as VALUE. The strongest difference between the various models tested is that the RNN-based models make this error less often. For instance, in the proposition: And the single most frequently used excuse of any debtor is "I didn't receive the letter/invoice/statement" the pronouns in the nested quote may be mistaken for subjectivity, leading to the structured SVMs predictions of VALUE or TESTIMONY, while the basic structured RNN correctly classifies it as FACT.
Link errors. While structured inference certainly helps baselines by preventing invalid structures such as cycles, it still depends on local decisions, losing to fully structured training in cases where joint proposition and link decisions are needed. For instance, in the following conclusion of an UKP essay, the annotators found no links: Indeed, no reasons are provided, but baseline are misled by the connectives: the SVM baseline outputs that b and c are PREMISEs supporting the CLAIM a. The full structured SVM combines the two tasks and correctly recognizes the link structure. Linear SVMs are still a very good baseline, but they tend to overgenerate links due to class imbalance, even if we use class weights during training. Surprisingly, RNNs are at the opposite end, being extremely conservative, and getting the highest precision among the models. On CDCP, where the number of true links is 272, the linear baseline with strict inference predicts 796 links with a precision of only 16%, while the strict structured RNN only predicts 52 links, with 33% precision; the example in Figure 5 illustrates this. In terms of higher-order structures, we find that using higherorder factors increases precision, at a cost in recall.   This is most beneficial for the 856 co-parent structures in the UKP test set: the full structured SVM has 53% F 1 , while the basic structured SVM and the basic baseline get 47% and 45% respectively. On CDCP, while higher-order factors help, performance on siblings and co-parents is below 10% F 1 score. This is likely due to link sparsity and suggests plenty of room for further development.

Conclusions and future work
We introduce an argumentation parsing model based on AD 3 relaxed inference in expressive factor graphs, experimenting with both linear struc-  tured SVMs and structured RNNs, parametrized with higher-order factors and link structure constraints. We demonstrate our model on a new argumentation mining dataset with more permissive argument structure annotation. Our model also achieves state-of-the-art link prediction performance on the UKP essays dataset.
Future work. Stab and Gurevych (2016) found polynomial kernels useful for modeling feature interactions, but kernel structured SVMs scale poorly, we intend to investigate alternate ways to capture feature interactions. While we focus on monological argumentation, our model could be extended to dialogs, for which argumentation theory thoroughly motivates non-tree structures (Afantenos and Asher, 2014).