Lexicon Guided Attentive Neural Network Model for Argument Mining

Identification of argumentative components is an important stage of argument mining. Lexicon information is reported as one of the most frequently used features in the argument mining research. In this paper, we propose a methodology to integrate lexicon information into a neural network model by attention mechanism. We conduct experiments on the UKP dataset, which is collected from heterogeneous sources and contains several text types, e.g., microblog, Wikipedia, and news. We explore lexicons from various application scenarios such as sentiment analysis and emotion detection. We also compare the experimental results of leveraging different lexicons.


Introduction
Argument Mining (AM) is an emerging research area that has drawn more and more attention since around 2010. Recently, Project Debater from IBM has shown such an AI machine supported by argument mining techniques can do well at arguing. The task of AM can be divided into a few stages: (1) Extracting argumentative components from large texts, i.e., boundary detection or segmentation; (2) Classifying the extracted components into classes. In general, an argumentative component can be categorized into "Claim", which usually contains conclusions and stance toward the given topic, or "Premise", which contains reasoning or evidence used to support or attack a claim; (3) Predicting the relations between the identified argumentative components, i.e., supporting and attacking (Cabrio and Villata, 2018). Some works also consider more complex relations such as recursively support/attack the relations themselves rather than merely build relations between components (Peldszus and Stede, 2013).
Argument detection and classification can improve legal reasoning (Moens et al., 2007), policy formulation (Florou et al., 2013), and persuasive writing (Stab and Gurevych, 2014). In this paper, we focus on mining argumentative components from a large collection of documents and further classifying them into roles of support/opposition. Our model is based on the recurrent neural network (RNN) , which has been widely used in natural language processing tasks (Cho et al., 2014). With the help of the attention mechanism (Bahdanau et al., 2015), RNN can further attend on the key information.
We propose a novel attention mechanism that is guided by argumentative lexicon information. Lexicon information is reported as one kind of the most frequently used features in argument mining (Cabrio and Villata, 2018). Previous works on AM have tried to integrate lexical features into the learning models (Levy et al., 2017;Nguyen and Litman, 2015;Rinott et al., 2015). These lexicons are mostly composed by human beings or derived by hand-crafted rules, and result in domainspecificity. That is, it may fail to be used for other domains. In the contrast of scarcity of general lexicon for AM, lexical resources are abundant in other fields like sentiment analysis, opinion mining, and emotion detection (Hu and Liu, 2004;Mohammad and Turney, 2013;Kiritchenko and Mohammad, 2016). As a more general domain, AM may get the benefits of not only in-domain lexicon, but also out-domain lexicons.
The contribution of this work is two-fold: (1) We propose an attention mechanism to leverage lexicon information.
(2) In the face of the scarcity of argument lexicon, we explore several different types of lexicons to verify whether outside resources are useful for AM tasks.
The rest of this paper is organized as follows. Section 2 summarizes related works about AM. The dataset and linguistic resources used for experiments are shown in Section 3. We introduce our model in Section 4 and show the experimental results in Section 5. We also look into the errors made by our best model in Section 6. Section 7 makes a discussion on experimental results and concludes this work.

Related Works
Neural networks have been used in varieties of AM tasks. To improve the vanilla LSTM model, Stab et al. (2018a) use attention mechanism to fuse topic and sentence information together. In the work of Laha and Raykar (2016), they present several bi-sequence classification models on different datasets. However, rather than using some sophisticated architecture such as attention, it considers only different concatenation or condition method on the output of LSTM. Eger et al. (2017) propose an end-to-end training model to mining argument structure, identifying argument components.
Besides syntactic and positional information, lexical information is also reported as one of the most used features in argument mining task (Cabrio and Villata, 2018). In some similar research fields such as sentiment analysis and emotion mining, a number of works have been proposed to combine lexical information with the NN models. Teng et al. (2016) use lexical scores as the weights and do the weighted sum over the outputs of the LSTM model, in order to derive the sentence scores. Zou et al. (2018) determines attention weights using lexicon labels, which lead the model to focus on the lexicon words. Bar-Haim et al. (2017) proposes an idea of expanding lexicons to improve stance classifying task.
However, in AM, seldom works directly combine lexicon with models. By using discourse feature, Levy et al. (2018) generates weak labels and use weak supervision. Shnarch et al. (2018) also present a methodology to blend such weak labeled data with high quality but scarce labeled data for AM. Al-Khatib et al. (2016) consider the distant supervision method. Most of these works use the end-to-end training paradigm with the outside resources only for generating the weak label, which may not fully leverage the information of the lexicons.

Resources
In this section, we introduce the dataset used to evaluate the performance of our proposed model. Besides, we describe each lexicon in brief and show how to perform the data preprocessing.

Data
We conduct the experiments on the dataset released by Stab et al. (2018b). 1 The dataset includes 25,492 sentences over eight topics that are randomly selected from an online list of controversial topics. 2 The selected topics, which are considered as queries, are used to retrieve documents from heterogeneous sources via the Google search engine. Among these sentences, 4,944 of them are supporting arguments, 6,195 are opposing arguments, and 14,353 are non-argument sentences. This dataset is commonly used for sentential argument identification task. Levy et al. (2018) collect a dataset with around 1.5 million sentences over 150 topics from Wikipedia. However, only 2,500 of them are labeled. It may not be sufficient for training a model, especially for neural network models.
The definition of argumentative components differs from dataset to dataset. In the dataset used in this work, an argumentative component is a span of text with reasoning or evidence, which is able to either support or oppose a topic (Stab et al., 2018b).

Lexicon resource
To improve the baseline model, we consider several existing lexicons across different domains. We first explore the claim lexicon that is built for argument mining task (Levy et al., 2017). We also include the lexicon resources often used in sentiment analysis (Hu and Liu, 2004) and emotion detection (Mohammad and Turney, 2013). We postulate that the resources for these application scenarios may have the potential for argument mining. We further develop a model based on the general purpose lexicon, WordNet (Miller, 1995).
These resources are applied in different ways. We use the claim lexicon (Levy et al., 2017), the sentiment lexicon (Hu and Liu, 2004), and the emotion lexicon (Mohammad and Turney, 2013) to extract critical words from the i-th input sentence C i , forming a sentence A i . In contrast, we consult WordNet (Miller, 1995) to expand the short topic T i into the corresponding A i . Claim Lexicon. The claim lexicon is a lexicon containing words with argumentative characteristics. Levy et al. (2017) use the appearance of the term "that" as a weak signal of sentences containing argumentative components. After collecting nearly 1.86M sentences, they compute the prior probability of the term "that" P (that) occurring in a sentence, and the probabilities P (that|w i ), where P (that|w i ) denotes the probability of a sentence having the term "that" and the word w i is in the suffix after the main concept (i.e. the target entity in a controversial topic), given the sentence containing w i . Those words with a probability P (that|w i ) > P (that) are included in their proposed claim lexicon, resulting a lexicon with around 600 claim words. The lexicon was then used for designing sentence pattern rules called claim sentence query (CSQ). They believe the claim lexicon can help detect sentences containing argument.
Sentiment Lexicon. Hu and Liu (2004) built a sentiment lexicon that contains around 6,800 words. Each word is labeled as negative and/or positive. We construct an additional sentence A by extracting the words that are in both sentiment lexicon and the input sentence C, regardless whether they are positive or negative.
Emotion Lexicon. The emotion lexicon built by Mohammad and Turney (2013) contains around 14,200 words. Each word in the lexicon is given eight emotion labels. An emotion in the lexicon could be one of eight emotions, including anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The labels are defined as follows: Where w i is a word, and e j is one of the eight emotions. In the experiment, we select the words that have at least one emotion labeled as 1, resulting a list of 6,468 words. We then use this list to create an additional sentence A from the input sentence C.
WordNet. To expand a topic T composed of words w T 1 , w T 2 , ..., w T K , we expand each of the words in it. For each word w T i , we use WordNet (Miller, 1995) to find its corresponding synonyms. We then put the found synonyms together, forming an additional sentence A, an expanded version of topic T .

Model
This section describes the development of the baseline model and the proposed model. To identify sentence-level argumentative components, the model is given a sentence C, which contains a sequence of words w c 1 , w c 2 , ..., w c N and a topic T with words w T 1 , w T 2 , ..., w T K . The input word sequence is then encoded as a sequence of word embeddings via the GloVe word vectors. The pre-trained word vectors with the dimension of 100 released by Pennington et al. (2014) are used. Based on the given input, the model makes a predictionŷ for the given sentence, i.e., classifying it as supporting argument, opposing argument, or non-argument. For comparison, we implement a baseline model with the vanilla bidirectional LSTM (BiLSTM).
In order to exploit the linguistic knowledge, Lei et al. (2018) highlight the sentiment words of the input sentence, computing attention weight for each word with them. By integrating the sentiment lexicon into the neural network model, the work successfully improves the performance in sentiment analysis. This work proposes a model that integrates an outside lexicon resource into attention mechanism (Vaswani et al., 2017). For each input sentence, we compose an additional sentence A, which contains words w a 1 , w a 2 , ..., w a M based on the given lexicon. The additional sentence A is then forwarded to the embedding layer together with input sentence C. The output of embedding layer is the sequences e c 1 , e c 2 , ..., e c N and e a 1 , e a 2 , ..., e a M , representing the embedded sentences C and A, respectively. Then, e c i is fed into BiLSTM and results in h c i at the corresponding time step. As for A, we add an RNN to collect its information and take the output h a M at the last time step as its representation. Though Lei et al. (2018) use an LSTM to encode the sentimental sentences, we do not follow their approach. In our work, the simple RNN outperforms the LSTM in the preliminary experiments.
The attention weight of the i-th word (i.e. α i ) is determined by the concatenation of the output of the BiLSTM h c i and the output of the RNN (i.e. h a M ), which is given the additional sentence A as the input: where α i indicates the attention weight of i-th word of the input sentence, and [h c i ; h a M ] indicates the concatenation of i-th hidden state and the RNN output state. The scoring function σ(·) is designed as: where W c indicates trainable parameters. All the weighted hidden states are then summed up, and connected to a fully connected layer for the final prediction: Figure 1 illustrates the architecture of our model.

Experiments
Because most of the lengths of input sentences are less than 60 and most of the lengths of additional sentences A are less than 20, we truncate them into lengths of 60 and 20 respectively. The dataset has 25,492 sentences in total. We conduct 5-fold cross validation for evaluating our model.
To evaluate our approaches, we report the average macro F 1 as ternary setting, precision and recall of predicting supporting arguments (P arg+ , R arg+ ), and precision and recall of predicting opposing arguments (P arg− , R arg− ). We run paired t-test for each proposed model in comparison with the baseline model, and mark the models having statistical significance (i.e. p-value < 0.05) with a wildcard. As the result shown in Table 1, we can observe that the proposed models benefit from the information from the adopted lexicons, improving the performance of argumentative components identification. The best model, which uses WordNet to expand topic T , outperforms the baseline model by 4.5 percentage in F 1 . The proposed model with the lowest F 1 score (i.e. ClaimLex) still outperforms the baseline by 3.4 percentage. Furthermore, the best performance reported by Stab et al. (2018b) on the same dataset is 0.4285 in macro F 1 , which is the result of only incorporating topic information into their models. This shows the impact of the lexicon information.
However, we can also observe that the result of integrating claim lexicon (Levy et al., 2017) is out of our expectation though it is a resource for argument mining. Possible reasons are figured out as follows. Firstly, the lexicon is built based on a strong assumption, i.e., the present of the term "that" indicates a high probability of the occurrence of argumentative components. Secondly, the lexicon has only 586 words, indicating a very small coverage with the whole vocabulary. Thirdly, the lexicon built from the sentences across 100 different topics contains a number of domainspecific words such as "LGBTQ" and "militarily". Highlighting of these domain-specific words may cause noise when the topic is unrelated to them.

Error Analysis
To know better what kind of sentence would mislead our model to make wrong predictions, we randomly sample the sentences with error from our best model, i.e., lexicon-guided attentive neural network model with WordNet. After looking into these errors, we find that the causes of a wrong prediction can be briefly categorized into the following cases. Some illustrations of the errors are listed in Table 2: (1) The sentences that have ambiguous words or state an open question can easily lead our model to predict the sentences' labels from non-argumentative to argumentative, or predict the labels from one stance, i.e., supporting or attacking, to the other. For example, both "imposing" and "abolishing" are shown in S1, which may cause the model to fails on correctly detecting the stances. S2 states an open question on the influence of death penalty, but the model mistakes it for an argumentative sentence; (2) When arguing over an issue, people may use irony to attack the opposite stance. Such statement may mislead the model, as S3 has shown; (3) We also find that our model may predict wrongly with the appearance of double negation. The part of the sentence S4, "a ban on human cloning should be opposed", conveys the supporting stance with a double negative statement. With a limited amount of training data, the model may not be able to comprehend rela-tively complicated syntax. On the other hand, some examples in the dataset may have been wrongly annotated. According to Stab et al. (2018b), arguments are defined as a span of text having reasoning or evidence that can be used to support or oppose a topic. S5 does explicitly declare its supporting stance, but nevertheless has no reasoning or evidence.

Conclusion
In this work, we propose a novel approach to leverage the lexicon from both in-domain and out-ofdomain sources for the task of argumentative component mining. We explore several sources from different application scenarios, from claim lexicon (Levy et al., 2017) to other domain resources such as sentiment analysis (Hu and Liu, 2004), emotion detection (Mohammad and Turney, 2013), and the general domain lexicon resource (Miller, 1995). Experimental results confirm the effectiveness of the integration of lexicon information. The scarcity of the resources in argument mining is also highlighted in the discussion.