Here’s My Point: Joint Pointer Architecture for Argument Mining

In order to determine argument structure in text, one must understand how individual components of the overall argument are linked. This work presents the first neural network-based approach to link extraction in argument mining. Specifically, we propose a novel architecture that applies Pointer Network sequence-to-sequence attention modeling to structural prediction in discourse parsing tasks. We then develop a joint model that extends this architecture to simultaneously address the link extraction task and the classification of argument components. The proposed joint model achieves state-of-the-art results on two separate evaluation corpora, showing far superior performance than the previously proposed corpus-specific and heavily feature-engineered models. Furthermore, our results demonstrate that jointly optimizing for both tasks is crucial for high performance.


Introduction
An important goal in argument mining is to understand the structure in argumentative text (Persing and Ng, 2016;Peldszus and Stede, 2015;Stab and Gurevych, 2016;Nguyen and Litman, 2016). One fundamental assumption when working with argumentative text is the presence of Arguments Components (ACs). The types of ACs are generally characterized as a claim or a premise (Govier, 2013), with premises acting as support (or possibly attack) units for claims (though some corpora have further AC types, such as major claim Gurevych, 2016, 2014b)).
The task of processing argument structure encapsulates four distinct subtasks (our work fo-cuses on subtasks 2 and 3): (1) Given a sequence of tokens that represents an entire argumentative text, determine the token subsequences that constitute non-intersecting ACs; (2) Given an AC, determine the type of AC (claim, premise, etc.); (3) Given a set/list of ACs, determine which ACs have directed links that encapsulate overall argument structure; (4) Given two linked ACs, determine whether the link is a supporting or attacking relation. This can be labeled as a 'micro' approach to argument mining (Stab and Gurevych, 2016). In contrast, there have been a number of efforts to identify argument structure at a higher level (Boltuzic andŠnajder, 2014;Ghosh et al., 2014;Cabrio and Villata, 2012), as well as slightly re-ordering the pipeline with respect to AC types (Rinott et al., 2015)).
There are two key assumptions our work makes going forward. First, we assume subtask 1 has been completed, i.e. ACs have already been identified. Second, we follow previous work that assumes a tree structure for the linking of ACs (Palau and Moens, 2009;Cohen, 1987;Peldszus and Stede, 2015;Stab and Gurevych, 2016). Specifically, a given AC can only have a single outgoing link, but can have numerous incoming links. Furthermore, there is a 'head' component that has no outgoing link (the top of the tree). Depending on the corpus (see Section 4), an argument structure can be either a single tree or a forest, consisting of multiple trees. Figure 1 shows an example that we will use throughout the paper to concretely explain how our approach works. First, the left side of the figure presents the raw text of a paragraph in a persuasive essay (Stab and Gurevych, 2016), with the ACs contained in square brackets. Squiggly vs straight underlining differentiates between claims and premises, respectively. The ACs have been annotated as to how they are linked, and the right side of the figure reflects this structure. The argument First, [ ::::::: cloning ::::: will ::: be :::::::::: beneficial :::: for ::::: many ::::::: people :::: who :::: are ::: in ::::: need ::: of ::::: organ  Figure 1: An example of argument structure with four ACs. The left side shows raw text that has been annotated for the presence of ACs. Squiggly or straight underlining means an AC is a claim or premise, respectively. The ACs in the text have also been annotated for links to other ACs, which is shown in the right figure. ACs 3 and 4 are premises that link to another premise, AC2. Finally, AC2 links to a claim, AC1. AC1 therefore acts as the central argumentative component. structure with four ACs forms a tree, where AC2 has two incoming links, and AC1 acts as the head, with no outgoing links. We also specify the type of AC, with the head AC marked as a claim and the remaining ACs marked as premises. Lastly, we note that the order of argument components can be a strong indicator of how components should relate. Linking to the first argument component can provide a competitive baseline heuristic (Peldszus and Stede, 2015;Stab and Gurevych, 2016).
Given the above considerations, we propose that sequence-to-sequence attention modeling, in the spirit of a Pointer Network (PN) (Vinyals et al., 2015b), can be effective for predicting argument structure. To the best of our knowledge, a clean, elegant implementation of a PN-based model has yet to be introduced for discourse parsing tasks. A PN is a sequence-to-sequence model ) that outputs a distribution over the encoding indices at each decoding timestep. More generally, it is a recurrent model with attention (Bahdanau et al., 2014), and we claim that as such, it is promising for link extraction because it inherently possesses three important characteristics: (1) it is able to model the sequential nature of ACs, (2) it constrains ACs to have a single outgoing link, thus partly enforcing the tree structure, and (3) the hidden representations learned by the model can be used for jointly predicting multiple subtasks. Furthermore, we believe the sequence-tosequence aspect of the model provides two distinct benefits: (1) it allows for two separate representations of a single AC (one for the source and one for the target of the link), and (2) the decoder networkcould learn to predict correct sequences of linked indices, which is a second recurrence over ACs. Note that we also test the sequence-to-sequence architecture against a simplified model that only uses hidden states from an encoding network to make predictions (see Section 5).
The main technical contribution of our work is a joint model that simultaneously predicts links between ACs and determines their type. Our joint model uses the hidden representation of ACs produced during the encoding step (see Section 3.4). While PNs were originally proposed to allow a variable length decoding sequence, our model differs in that it decodes for the same number of timesteps as there are inputs. This is a key insight that allows for a sequence-to-sequence model to be used for structural prediction. Aside from the partial assumption of tree structure in the argumentative text, our models do not make any additional assumptions about the AC types or connectivity, unlike the work of Peldszus (2014). Lastly, in respect to the broad task of parsing, our model is flexible because it can easily handle nonprojective, multi-root dependencies. We evaluate our models on the corpora of Stab and Gurevych (2016) and Peldszus (2014), and compare our results with the results of the aformentioned authors. Our results show that (1) joint modeling is imperative for competitive performance on the link extraction task, (2) the presence of the second recurrence improves performance over a non-sequence-to-sequence model, and (3) the joint model can outperform models with heavy featureengineering and corpus-specific constraints.

Related Work
Palau and Moens (2009) is an early work in argument mining, using a hand-crafted Context-Free Grammar to determine the structure of ACs in a corpus of legal texts. Lawrence et al. (2014) leverage a topic modeling-based AC similarity to uncover tree-structured arguments in philosophical texts. Recent work offers data-driven approaches to the task of predicting links between ACs. Stab and Gurevych (2014b) approach the task as a binary classification problem. The authors train an SVM with various semantic and structural features. Peldszus and Stede have also used classification models for predicting the presence of links (2015). The first neural network-based model for argumentation mining was proposed by Laha and Raykar (2016), who use two recurrent networks in end-to-end fashion to classify AC types.
Various authors have also proposed to jointly model link extraction with other subtasks from the argument mining pipeline, using either an Integer Linear Programming (ILP) framework (Persing and Ng, 2016; Stab and Gurevych, 2016) or directly feeding previous subtask predictions into a tree-based parser. The former joint approaches are evaluated on an annotated corpus of persuasive essays Gurevych, 2014a, 2016), and the latter on a corpus of microtexts (Peldszus, 2014). The ILP framework is effective in enforcing a tree structure between ACs when predictions are made from otherwise naive base classifiers.
Recurrent neural networks have previously been proposed to model tree/graph structures in a linear manner. Vinyals et al. (2015c) use a sequence-tosequence model for the task of syntactic parsing. Bowman et al. (2015) experiment on an artificial entailment dataset that is specifically engineered to capture recursive logic (Bowman et al., 2014). Standard recurrent neural networks can take in complete sentence sequences and perform competitively with a recursive neural network. Multitask learning for sequence-to-sequence has also been proposed (Luong et al., 2015), though none of the models used a PN for prediction.
In the field of discourse parsing, the work of  is the only work, to our knowledge, that incorporates attention into the network architecture. However, the attention is only used in the process of creating representations of the text itself. Attention is not used to predict the overall discourse structure. In fact, the model still relies on a binary classifier to determine if textual components should have a link. Arguably the most similar approach to ours is in the field of dependency parsing (Cheng et al., 2016). The authors propose a model that performs 'queries' between word representations in order to determine a distribution over potential headwords.

Proposed Approach
In this section, we describe our approach to using a sequence-to-sequence model with attention for argument mining, specifically, identifying AC types and extracting the links between them. We begin by giving a brief overview of these models.

Pointer Network
A PN is a sequence-to-sequence model  with attention (Bahdanau et al., 2014) that was proposed to handle decoding sequences over the encoding inputs, and can be extended to arbitrary sets (Vinyals et al., 2015a). The original motivation for a pointer network was to allow networks to learn solutions to algorithmic problems, such as the traveling salesperson and convex hull problems, where the solution is a sequence over input points. The PN model is trained on input/output sequence pairs (E, D), where E is the source and D is the target (our choice of E,D is meant to represent the encoding, decoding steps of the sequence-to-sequence model). Given model parameters Θ, we apply the chain rule to determine the probability of a single training example: (1) where the function m signifies that the number of decoding timesteps is a function of each individual training example. We will discuss shortly why we need to modify the original definition of m for our application. By taking the log-likelihood of Equation 1, we arrive at the optimization objective: which is the sum over all training example pairs. The PN uses Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) for sequential modeling, which produces a hidden layer h at each encoding/decoding timestep. In practice, the PN has two separate LSTMs, one for Figure 2: Applying a Pointer Network to the example paragraph in Figure 1 with LSTMs unrolled over time. Note that D1 points to itself to denote that it has not outgoing link and is therefore the head of a tree.
encoding and one for decoding. Thus, we refer to encoding hidden layers as e, and decoding hidden layers as d.
The PN uses a form of content-based attention (Bahdanau et al., 2014) to allow the model to produce a distribution over input elements. This can also be thought of as a distribution over input indices, wherein a decoding step 'points' to the input. Formally, given encoding hidden states (e 1 , ..., e n ), the model calculates p(D i |D 1 , ..., D i−1 , E) as follows: where matrices W 1 , W 2 and vector v are parameters of the model (along with the LSTM parameters used for encoding and decoding). In Equation 3, prior to taking the dot product with v, the resulting transformation can be thought of as creating a joint hidden representation of inputs i and j. Vector u i in equation 4 is of length n, and index j corresponds to input element j. Therefore, by taking the softmax of u i , we are able to create a distribution over the input.

Link Extraction as Sequence Modeling
A given piece of text has a set of ACs, which occur in a specific order in the text: (C 1 , ..., C n ). Therefore, at encoding timestep i, the model is fed a representation of C i . Since the representation is large and sparse (see Section 3.3 for details on how we represent ACs), we add a fully-connected layer before the LSTM input. Given a representation R i for AC C i , the LSTM input A i is calculated as: where W rep , b rep in turn become model parameters, and σ is the sigmoid function 1 . Similarly, the decoding network applies a fully-connected layer with sigmoid activation to its inputs, see Figure 3. At encoding step i, the encoding LSTM produces hidden layer e i , which can be thought of as a hidden representation of AC C i . In order to make sequence-to-sequence modeling applicable to the problem of link extraction, we explicitly set the number of decoding timesteps to be equal to the number of input components. Using notation from Equation 1, the decoding sequence length for an encoding sequence E is simply m(E) = |{C 1 , ..., C n }|, which is trivially equal to n. By constructing the decoding sequence in this manner, we can associate decoding timestep i with AC C i .
From Equation 4, decoding timestep i will output a distribution over input indices. The result of this distribution will indicate to which AC component C i links. Recall there is a possibility that an AC has no outgoing link, such as if it's the root of the tree. In this case, we state that if AC C i does not have an outgoing link, decoding step D i will output index i. Conversely, if D i outputs index j, such that j is not equal to i, this implies that C i has an outgoing link to C j . For the argument structure in Figure 1, the corresponding decoding sequence is (1, 1, 2, 2). The topology of this decoding sequence is illustrated in Figure 2. Observe how C 1 points to itself since it has no outgoing link.
Finally, we note that we have a Bidirectional LSTM (Graves and Schmidhuber, 2005) as the encoder, unlike the model proposed by Vinyals et al. (2015b). Thus, e i is the concatenation of forward and backward hidden states − → e i and ← − e n−i+1 , produced by two separate LSTMs. The decoder remains a standard forward LSTM.

Representing Argument Components
At each timestep of the encoder, the network takes in a representation of an AC. Each AC is itself  Figure 3: Architecture of the joint model applied to the example in Figure 1. Note that D1 points to itself to denote that it has not outgoing link and is therefore the head of a tree. a sequence of tokens, similar to the Question-Answering dataset from Weston et al. (2015). We follow the work of Stab and Gurevych (2016) and focus on three different types of features to represent our ACs: (1) Bag-of-Words of the AC; (2) Embedding representation based on GloVe embeddings (Pennington et al., 2014), which uses average, max, and min pooling across the token embeddings; (3) Structural features: Whether or not the AC is the first AC in a paragraph, and whether the AC is in an opening, body, or closing paragraph. See Section 6 for an ablation study of the proposed features.

Joint Neural Model
Up to this point, we focused on the task of extracting links between ACs. However, recent work has shown that joint models that simultaneously try to complete multiple aspects of the subtask pipeline outperform models that focus on a single subtask (Persing and Ng, 2016;Stab and Gurevych, 2014b;Peldszus and Stede, 2015). Therefore, we will modify the single-task architecture so that it would allow us to perform AC classification (Kwon et al., 2007;Rooney et al., 2012) together with link prediction. Knowledge of an individual subtask's predictions can aid in other subtasks. For example, claims do not have an outgoing link, so knowing the type of AC can aid in the link prediction task. This can be seen as a way of regularizing the hidden representations from the encoding component (Che et al., 2015). At each timestep, predicting AC type is a straightforward classification task: given AC C i , we need to predict whether it is a claim, premise, or possibly major claim. More generally, this is another sequence modeling problem: given input sequence E, we want to predict a sequence of argument types T . For encoding timestep i, the model creates hidden representation e i . This can be thought of as a representation of AC C i . Therefore, our joint model will simply pass this representation through a fully-connected layer as follows: where W cls , b cls become elements of the model parameters, Θ. The dimensionality of W cls , b cls is determined by the number of classes. Lastly, we use softmax to form a distribution over the possible classes. Consequently, the probability of predicting the component type at timestep i is defined as: Finally, combining this new prediction task with Equation 2, we arrive at the new training objective: which simply sums the costs of the individual prediction tasks, and the second summation is the cost for the new task of predicting AC type. α ∈ [0, 1] is a hyperparameter that specifies how we weight the two prediction tasks in our cost function. The architecture of the joint model, applied to our ongoing example, is illustrated in Figure 3.

Experimental Design
As we have mentioned, our work assumes that ACs have already been identified. The order of ACs corresponds directly to the order in which the ACs appear in the text. We test the effectiveness of our proposed model on a dataset of persuasive essays (PEC) (Stab and Gurevych, 2016), as well as a dataset of microtexts (MTC) (Peldszus, 2014). The feature space for the PEC has roughly 3,000 dimensions, and the MTC feature space has between 2,500 and 3,000 dimensions, depending on the data split. The PEC contains a total of 402 essays, with a frozen set of 80 essays held out for testing. There are three AC types in this corpus: major claim, claim, and premise. In this corpus, individual structures can be either trees or forests. Also, in this corpus, each essay has multiple paragraphs, and argument structure is only uncovered within a given paragraph. The MTC contains 112 short texts. Unlike the PEC, each text in this corpus is itself a complete example, as well as a single tree. Since the dataset is small, the authors have created 10 sets of 5-fold cross-validation, reporting the the average across all splits for final model evaluation. This corpus contains only two types of ACs: claim and premise. Note that link prediction is directed, i.e., predicting a link between the pair C i , C j (i = j) is different than C j , C i . We implement our models in TensorFlow (Abadi et al., 2015). We use the following parameters: hidden input dimension size 512, hidden layer size 256 for the bidirectional LSTMs, hidden layer size 512 for the LSTM decoder, α equal to 0.5, and dropout (Srivastava et al., 2014) of 0.9. We believe the need for such high dropout is due to the small amounts of training data (Zarrella and Marsh, 2016), particularly in the MTC. All models are trained with Adam optimizer (Kingma and Ba, 2014) with a batch size of 16. For a given training set, we randomly select 10% to become the validation set. Training occurs for 4,000 epochs. Once training is completed, we select the model with the highest validation accuracy (on the link prediction task) and evaluate it on the held-out test set. At test time, we take a greedy approach and select the index of the probability distribution (whether link or type prediction) with the highest value.

Results
The results of our experiments are presented in Tables 1 and 2. For each corpus, we present f1 scores for the AC type classification experiment, with a macro-averaged score of the individual class f1 scores. We also present the f1 scores for predicting the presence/absence of links between ACs, as well as the associated macro-average between these two values.
We implement and compare four types of neural models: 1) The previously described joint model from Section 3.4 (called Joint Model in the tables); 2) The same as 1), but without the fullyconnected input layers (called Joint Model No FC Input in the table); 3) The same as 1), but the model only predicts the link task, and is therefore not optimized for type prediction (called Single-Task Model in the table); 4) A non-sequence-tosequence model that uses the hidden layers produced by the BLSTM encoder with the same type of attention as the joint model (called Joint Model No Seq2Seq in the table). That is, d i in Equation 3 is replaced by e i .
In both corpora we compare against the following previously proposed models: Base Classifier (Stab and Gurevych, 2016) is a feature-rich, taskspecific (AC type or link extraction) SVM classifier. Neither of these classifiers enforce structural or global constraints. Conversely, the ILP Joint Model (Stab and Gurevych, 2016) provides constraints by sharing prediction information between the base classifiers. For example, the model attempts to enforce a tree structure among ACs within a given paragraph, as well as using incoming link predictions to better predict the type class claim. For the MTC only, we also have the following comparative models: Simple (Peldszus and Stede, 2015) is a feature-rich logistic regression classifier. Best EG (Peldszus and Stede, 2015) creates an Evidence Graph (EG) from the predictions of a set of base classifiers. The EG models the potential argument structure, and offers a global optimization objective that the base classifiers attempt to optimize by adjusting their individual weights. Lastly, MP+p (Peldszus and Stede, 2015) combines predictions from base classifiers with a Minimum Spanning Tree Parser (MSTParser).

Discussion
First, we point out that the joint model achieves state-of-the-art on 10 of the 13 metrics in Tables 1 and 2, including the highest results in all metrics on the PEC, as well as link prediction on the MTC. The performance on the MTC is very en-  As for the other neural models, the joint model with no seq2seq performs competitively with the ILP joint model on the PEC, but trails the performance of the joint model. We believe this is because the joint model is able to create two different representations for each AC, one each in the encoding/decoding state, which benefits performance in the two tasks. We also believe that the joint model benefits from a second recurrence over the ACs, modeling the tree/forest structure in a linear manner. Conversely, the joint model with no seq2seq must encode information relating to type as well as link prediction in a single hidden representation. On one hand, the joint model no seq2seq outperforms the ILP model on link prediction, yet it is not able to match the ILP joint model's performance on type prediction, primarily due to the poor performance on predicting the major claim class. Another interesting outcome is the importance of the fully-connected layer before the LSTM input. This extra layer seems to be crucial for improving performance on this task. The results dictate that even a simple fully-connected layer with sigmoid activation can provide a useful dimensionality reduction step. Finally, and arguably most importantly, the single-task model, only optimized for link prediction, suffers a large drop in performance, conveying that the dual optimization of the joint model is vital for high performance in the link prediction task. We believe this is because the joint optimization creates more expressive representations of the ACs, which capture the natural relation between AC type and AC linking. Table 3 shows the results of an ablation study for AC feature representation. Regarding link prediction, BOW features are clearly the most important, as their absence results in the highest drop in performance. Conversely, the presence of structural features provides the smallest boost in performance, as the model is still able to record state-   Table 4: Results of binning test data by length of AC sequence. * indicates that this bin does not contain any major claim labels, and this average only applies to claim and premise classes. However, we do not disable the model from predicting this class: the model was able to avoid predicting this class on its own.
of-the-art results compared to the ILP Joint Model. This shows that the Joint Model is able to capture structural cues through sequence modeling and semantics. When considering type prediction, both BOW and structural features are important, and it is the embedding features that provide the least benefit. The ablation results also provide an interesting insight into the effectiveness of different pooling strategies for using individual token embeddings to create a multi-word embedding. The popular method of averaging embeddings (which is used by Stab and Gurevych (2016) in their system) is in fact the worst method, although its performance is still competitive with the previous state-of-the-art. Conversely, max pooling results are on par with the joint model results in Table 1. Table 4 shows results on the PEC test set with the test examples binned by sequence length. First, it is not surprising to see that the model performs best when the sequences are the shortest (for link prediction; type prediction actually sees the worst performance in the middle bin). As the sequence length increases, the accuracy on link prediction drops. This is possibly due to the fact that as the length increases, a given AC has more possibilities as to which other AC it can link to, making the task more difficult. Conversely, there is actually a rise in no link prediction accuracy from the second to third row. This is likely due to the fact that since the model predicts at most one outgoing link, it indirectly predicts no link for the remaining ACs in the sequence. Since the chance probability is low for having a link between a given AC in a long sequence, the no link performance is actually better in longer sequences. The results of the length-based binning could also potentially give insight into the poor performance on the type prediction task in the MTC. Since the arguments in the MTC average 5 ACs, they would be in the second bin (row 2) of Table 4. The claim and premise f1 scores for this bin are similar to those from the same system's performance on the MTC.

Conclusion
In this paper we have proposed how to use a joint sequence-to-sequence model with attention (Vinyals et al., 2015b) to both extract links between ACs and classify AC type. We evaluate our models on two corpora: a corpus of persuasive essays (Stab and Gurevych, 2016), and a corpus of microtexts (Peldszus, 2014). The Joint Model records state-of-the-art results on the persuasive essay corpus, as well as achieving state-of-the-art results for link prediction on the microtext corpus. The results show that jointly modeling the two prediction tasks is critical for high performance. Future work can attempt to learn the AC representations themselves, such as in Kumar et al. (2015).
Lastly, future work can integrate subtasks 1 and 4 into the model. The representations produced by Equation 3 could potentially be used to predict link type, i.e. supporting or attacking (the fourth subtask in the pipeline). In addition, a segmenting technique, such as the one proposed by Weston et al. (2014), can accomplish subtask 1.