Towards Better Non-Tree Argument Mining: Proposition-Level Biaffine Parsing with Task-Specific Parameterization

State-of-the-art argument mining studies have advanced the techniques for predicting argument structures. However, the technology for capturing non-tree-structured arguments is still in its infancy. In this paper, we focus on non-tree argument mining with a neural model. We jointly predict proposition types and edges between propositions. Our proposed model incorporates (i) task-specific parameterization (TSP) that effectively encodes a sequence of propositions and (ii) a proposition-level biaffine attention (PLBA) that can predict a non-tree argument consisting of edges. Experimental results show that both TSP and PLBA boost edge prediction performance compared to baselines.


Introduction
Argument mining, a research area that focuses on predicting argumentation structures in a text, has been receiving much attention. To date, efforts in argument mining were devoted to predicting tree arguments in which a claim proposition is represented as a root and premise propositions are represented as leaves. For example, Stab and Gurevych (2017) introduced Argument Annotated Essays (hereafter, Essay), and researchers attempted to predict tree arguments in the corpus (Eger et al., 2017;Potash et al., 2017;Kuribayashi et al., 2019).
However, these techniques lack the capability of dealing with more flexible arguments such as reason edges where a proposition can have several parents. To this end, Park and Cardie (2018) provided a less restrictive argument mining dataset known as Cornell eRulemaking Corpus (CDCP), which contains flexible edges (see VALUES (a), (b), and TESTIMONY (e) in Figure 1). Figure 2 shows a distribution of outgoing edges for Essay and CDCP. Propositions in CDCP have sparse connections, making the majority of propositions iso-  Figure 2: Distribution of the outgoing edges (i.e., Support/Attack or REASON/EVIDENCE relations) from a node (proposition) in Essay and CDCP corpora lated from the others. Besides, a proposition in Essay has at most one outgoing edge, while that in CDCP has a variable number of edges (i.e., there are about 200 propositions which have two or more outgoing edges). Therefore, it is important to work on the less restrictive arguments. Yet, it has not been deeply studied except a few studies (Niculae et al., 2017;Galassi et al., 2018).
In this paper, we present a novel model for nontree argument mining. Different from the previous studies of Niculae et al. (2017);Galassi et al. (2018), we focus on an effective encoding for the propositions and a graph-based non-tree argument parsing technique. Given sentence or clause spans in an argument, our model jointly predicts proposition types for the spans, edges between the propositions and edge labels by employing following two architectures: -Task-Specific Parameterization (TSP) is an effective encoding step for the proposition sequence. On top of a shared encoder, we prepare two dis-tinct attention-to-encoder layers to maintain taskspecific representations. One is for the proposition type, and the other for the edges (and their labels). TSP employs our expectation that edge-and proposition type-specific representations should be separately obtained. This is because representations of proposition types and edges are relatively less bonded when compared to the tree-structured Essay where each premise proposition always has one outgoing edge.
-Proposition-Level Biaffine Attention (PLBA) is used to predict non-tree edges after the encoding step. Biaffine attention has recently been used for syntactic or semantic token-to-token dependency parsing (Dozat andManning, 2017, 2018;Zhang et al., 2019;Li et al., 2019b,a). We extend the biaffine attention to predict proposition-to-proposition dependencies.
Experimental results on CDCP show that our proposed model improves performance. Analyses also show that task-specific information can be captured by TSP.

Dataset
We use CDCP (Park and Cardie, 2018;Niculae et al., 2017) with 731 arguments. The corpus provides five types of propositions (32 REFER-ENCE, 746 FACT, 1026 TESTIMONY, 2160 VALUE and 815 POLICY), and two types of argumentative edges (1307 REASON and 46 EVIDENCE). For example, FACT poses a truth value that can be verified with objective evidence: That process usually takes as much as two years or more. CDCP also provides directed edges between propositions and edge label. A proposition i is REASON for a proposition j if i provides rationale for j, or is EVIDENCE if it proves whether j is true or not.

Task Formalization
Input: We assume a text consisting of N tokens and M proposition spans is given. We denote the i-th proposition span as (START(i), END(i)) where START(i) and END(i) are the starting and ending token indices, respectively. Thus, 1 ≤ START(i) ≤ END(i) ≤ N . Output: For each given span i, we predict its proposition type, outgoing edges, and edge labels (i.e., REASON and EVIDENCE), where the graph does not necessarily form a tree.

Approach
An overview of our proposed model is shown in Figure 3 (right). We encode propositions by TSP, and use PLBA to obtain non-tree arguments.
We use w t to denote the concatenation of t-th set of word features, each set consisting of a surface, a part-of-speech tag, a GloVe vector (Pennington et al., 2014) and an optional ELMo vector (Peters et al., 2018). The input words for span i are fed into a bidirectional LSTM:

TSP: Task-Specific Parameterization
We provide task-specific encoding layers, one for proposition types and the other for edges (and their labels), on the top of the BILSTM. We expect the lower layers to extract task-universal representations and the upper layers to extract more task-specific representations (Liu et al., 2019;Ethayarajh, 2019). First, to be aware of informative tokens such as discourse markers, we obtain task-aware span representations for each task τ ∈ {type, edge}: where v τ , W τ and b τ are parameters. We note that h span att τ,i ∈ {h span att type,i , h span att edge,i }. Then, each typeand edge-specific proposition span is represented as: where ⊕ is a concatenation operation and φ(i) is a span length feature. The span representations are then fed into new BiLSTMs to encode task-specific proposition sequences:

PLBA: Proposition-Level Biaffine Attention
To predict non-tree edges between propositions, we use biaffine attention (Dozat and Manning, 2018) Value Value that computes scores of all proposition pairs by the following operation: where U k is a parameter. We apply multi-layer perceptrons (MLPs) and a biaffine operation to a pair of edge-specific representations (s edge,i , s edge,j ) to obtain a probability of a directed edge from i-th span to j-th span: , and the label for the edge (i, j) is calculated as We train edges and labels by summing the losses, backpropagating gradients for the labels only through gold edges. At inference, the predicted labels are masked by the edges:êdge i,j ⊗label i,j .

Joint Learning with Proposition Type
We classify the proposition type for span i with the type-specific representation:type i = softmax MLP type s type,i . Finally, we minimize the joint objective of edge loss L where λ are hyperparameters to adjust training.

Experiments
Following Niculae et al. (2017), we evaluate the test set of CDCP that contains 973 propositions and 272 edges. F1 scores for the proposition type prediction and the edge prediction along with their average are used for the evaluations. For the edge labels, we only consider the classification of EVIDENCE rather than macro-averaged scores because labels are highly imbalanced. We calculate label scores on gold edges.

Baselines
To the best of our knowledge, two existing studies are comparable in our task settings. The first set of baselines are factor-based models (SVM basic/full/strict ; RNN basic/full/strict; Niculae et al., 2017). Another set of baselines are neural residual models (deep basic PG/LG ; deep residual PG/LG; Galassi et al., 2018), which are the state-of-the-art models in terms of edge classification. We also provided a non-TSP model for comparison where we use a joint aggregation to make s type,i = s edge,i . To this end, we provide a shared According to the change above, the non-TSP model also requires us to modify the pre-biaffine MLPs and the proposition type classifier (see Appendix for more details).

Implementation
GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018) were used as input embeddings. The hyperparameters were tuned with Optuna (Akiba et al., 2019) without using ELMo and TSP for fair comparison (see Appendix for more details). Each model was trained for 100 epochs with Adam (Kingma and Ba, 2015), and we selected a model that exhibited the highest average development F1 scores amongst all the classifiers.

Results
We ran the experiment 30 times with different random seeds. Table 1 shows their average scores, showing our models outperform all the baselines.  Figure 4 shows ablation studies. The non-ELMo model already outperforms the state-of-the-art baseline in the edge prediction task, showing that PLBA is effective. Besides, ELMo boosted the type classification. Figure 4a shows that the edge scores for the non-multi-task model are significantly lower, while Figure 4b shows that its type scores are barely affected. The result implies the edge task utilizes type information in the lower layer, but the type task is less dependent on edges. Besides, the edge scores for the non-TSP model are worse, indicating that TSP is effective in obtaining a stable performance. The result implies that TSP acquires edge-specific representations independently from types.

What Does TSP Learn?
To further analyze TSP, we investigated the taskspecific token attention s τ,i,t . Figure 5 shows the attention distributions by a kernel density estimation for a number of selected tokens. The figure shows that not only discourse markers (i.e., because, but and so) but rhetorical or subjective claims (i.e., why and disagree) were focused in edge predictions. We found in the corpus that propositions with disagree and why are likely to be a top (claim) node. This suggests that these subjective statements can be used for predicting the top nodes.
For proposition types, a number of first-person pronouns such as I were useful. We attribute this result to the TESTIMONY propositions which express personal experiences, e.g., but I never received any notice from my original mortgage lender that my mortgage was sold.

Related Work
Researchers in argument mining have been utilizing Essay (Stab and Gurevych, 2014), a tree argument corpus. For example, Persing and Ng (2016) employed integer linear programming. Eger et al.
(2017) investigated argument mining as a dependency parsing problem with neural models. Potash et al. (2017) developed a pointer network architecture to predict edges. However, we cannot simply utilize them for non-tree arguments because these models were built upon the assumption that an argument forms a tree structure.
Non-tree arguments are relatively less emphasized. Niculae et al. (2017) attempted to resolve the problem with a factor-based model. Our study is primarily inspired by the semantic dependency parsing of Dozat and Manning (2018) and we predict the whole graph jointly. Galassi et al. (2018) proposed a deep learning-based model that utilizes residual connections to predict proposition pair relations.

Conclusion
This paper focused on non-tree argument mining. We provided an approach to effectively encode a proposition sequence and to predict non-tree edges. Experimental results showed that our proposed model outperforms baselines. This paper demonstrated that we could successfully analyze more flexible structures in arguments. For future work, we aim to develop a universal model to handle both tree and non-tree arguments.

References
where ELMo k START(i):END(i) (0 < k ≤ N ELMo ) is the hidden state of the k-th layer of the ELMo obtained by START(i) to END(i) tokens, ELMo 0 START(i):END(i) are the features from character-level CNN in ELMo, and s k are trainable parameters. The ELMo paramters are fixed by truncating backpropagation.
The surface and POS tag of a token are each embedded into a vector. A multi-layered perceptron (MLP) is applied to each surface and POS. All features are then concatenated to form input token representation: Optionally, we can concatenate ELMo:

A.2 Non-TSP Model
For non-TSP model in experiments, we provide a shared representation for both type and edge: label (s type&edge,j ), and the proposition type classifier: type i = softmax MLP type s type&edge,i .

A.3 Hyperparameter Tuning
We tuned the hyperparameters using a subset considering our preliminary experiments. See Table 2 for hyperparameter search space and list of hyperparameters chosen by the Optuna framework (Akiba et al., 2019). We tried 20 hyperparameter sets. As can be seen from the table, the high dropout rate is effective. We estimate this is because the system can prevent an overfitting. We also found stacking BiLSTMs in TSP higher can improve performance, implying the semantics can be captured in upper layers.