DSNDM: Deep Siamese Neural Discourse Model with Attention for Text Pairs Categorization and Ranking

In this paper, the utility and advantages of the discourse analysis for text pairs categorization and ranking are investigated. We consider two tasks in which discourse structure seems useful and important: automatic verification of political statements, and ranking in question answering systems. We propose a neural network based approach to learn the match between pairs of discourse tree structures. To this end, the neural TreeLSTM model is modified to effectively encode discourse trees and DSNDM model based on it is suggested to analyze pairs of texts. In addition, the integration of the attention mechanism in the model is proposed. Moreover, different ranking approaches are investigated for the second task. In the paper, the comparison with state-of-the-art methods is given. Experiments illustrate that combination of neural networks and discourse structure in DSNDM is effective since it reaches top results in the assigned tasks. The evaluation also demonstrates that discourse analysis improves quality for the processing of longer texts.


Introduction
The growing popularity of social networks and the widespread use of social media contributed to the emergence of many NLP tasks associated with the processing of statements. It can be analyzed from an emotional point of view (sentiment analysis), opinion and argumentation mining, text summarization and so forth.
Despite the success of the transformer-based neural networks, such as BERT (Devlin et al., 2018) and its modifications, in various NLP tasks, they also have disadvantages since they frequently analyze only the plain text that can be quite long and complex. At the same time, discourse structure contains important knowledge for solving these tasks, and several researchers demonstrated its significance (Galitsky et al., 2015;Bhatia et al., 2015;Ji and Smith, 2017). However, the value of discourse have been already investigated only for some single text categorization tasks.
In this study, we demonstrate the utility and advantages of the matching of discourse tree structures of text pairs. Discourse analysis seems effective in textual entailment, text simplification and paraphrase detection tasks. However, it is necessary to analyze texts on the sentence level in most cases. We consider typical NLP tasks in which input texts are quite long and paragraphs are given initially.
One of such tasks is automatic verification of factual texts. Politicians may utilize unreliable statements for their own purposes. Due to the fact that there are plenty of such statements, it should be automatically evaluated for reliability and the possibility of manipulation of public opinion. In most cases, it is possible to extract some confirmation or refutation for a given factual text. In this way, we investigate the utility of discourse analysis in the classification of pairs of texts: statements and their justifications. Discourse structure may contain crucial knowledge even for the classification of the statements alone but can be even more effective in the case of analyzing additionally the confirmations and refutations.
Apart from that, one of the most appropriate tasks is the ranking in question answering systems. It was shown that discourse structure of questions and correct answers should correlate (Galitsky et al., 2015). Companies are interested in QA systems development in order to maximize the ease of interaction with customers. All questions can be divided into two groups: factoid and non-factoid. It is important to answer factoid questions to provide some specific information and non-factoid ones to maintain a dialogue. It is worth to emphasize that the second task is more challenging because there is no single correct answer for each question. We consider the non-factoid questions asked on Internet forums since the discourse analysis seems to be more helpful in this case.
The main technical idea of this paper is to combine discourse analysis and recursive neural network TreeLSTM (Tai et al., 2015), which previously obtained the state-of-the-art results in some single text classification tasks.
Our contributions can be formulated as the following: • We propose a neural network approach to learn the match between pairs of the discourse tree structures. To this end, we modify the basic TreeLSTM model to effectively encode the discourse structure and propose DSNDM (Deep Siamese Neural Discourse Model) to analyze pairs of texts.
• We suggest the way of integration of the attention mechanism in the DSNDM model.
• We investigate the value of the proposed approach considering two tasks and experimentally confirm the utility and importance of discourse analysis for the text pairs processing.
Our paper is organized as follows. Firstly, we summarize related work and introduce some base concepts. We continue with the description of the base model and its modifications. Then, we discuss the obtained results, error analysis and propose directions for further research.

Related Work
There are several approaches to solve the factchecking problem. The best models presented in the FEVER competition (Thorne et al., 2018) allocate a stage of extracting supporting or refuting information and a classification stage. Justifications have already been extracted in our case. Therefore, there is no need to use the first stage. The BERT model (Devlin et al., 2018) is frequently used as the main model of the approach (Nie et al., 2019;Alonso-Reina et al., 2019). It is worth to emphasize that BERT cannot process long texts (the sequence is limited to 512 tokens). Therefore, it is necessary to extract the key information from the given justification paragraph. Besides, BERT can not efficiently store and process discourse features.
Another approach uses knowledge graphs. Clancy et al. (2019) proposed the use of relations between the entities of the graph in order to confirm some "distill" information extracted from the statement. Ciampaglia et al. (2015) suggested Knowledge Linker, the main idea of which is that if the path between entities in the knowledge graph is short, then the factual text containing them is reliable. It should be mentioned that this approach is generally applicable only to factoid statements since the entities must exist within knowledge graphs.
Finally, we distinguish the third approach which considers structural information extracted from texts (Wu et al., 2017;Galitsky and Ilvovsky, 2016). Galitsky et al. (2015) proposed to match discourse trees and solved the categorization task using Tree Kernel-based SVM. However, this approach does not utilize any modern neural networks. At the same time, recursive neural networks are gaining popularity (Ji and Smith, 2017;Bhatia et al., 2015;Tai et al., 2015). The main goal of them is to encode tree-like structures, such as syntax and discourse trees. These models achieved superior results in the single text categorization tasks, but researchers did not investigate the value of discourse analysis for processing pairs of texts. However, this approach is promising for the assigned task since frequently not only unreliable texts have a similar discourse structure, but the discourse structure of texts refuting them is also similar.
The main baselines for the question-answering problem are models that utilize keywords for ranking: using TF-iDF, BM25 and its modifications (Okapi BM25, BM25F). Frequently, their results are bad enough and need to be re-ranked using more complex methods. Neural network models, such as BERT, allow obtaining state-of-the-art results (Hashemi et al., 2020). It should be mentioned here that different training techniques of ranking are often not investigated.
In addition, some fact-checking approaches can be applied in question answering systems. For instance, Cui et al. (2017) and Liu et al. (2019) considered the possibility of using knowledge graphs. Galitsky (2019) investigated the value of discourse analysis in QA systems, but did not utilize any neural network approaches. Text: "Wow, make up your mind. Either populations change over time (evolve) or they don't. Would your wingless beetles be able to produce wings again if they were somehow beneficial again? You are starting to sound like Darwin and his finches." Yoy are ... finches. structed step by step from the leaves to the root.
Initially, the text is divided into several intervals, called elementary discourse units (EDUs). Each of them contains a single thought, which cannot be broken more. Further, these intervals are connected by discourse relations such as "Elaboration", "Joint" and "Condition". After the unification of the elementary units, there are formed larger intervals of the text, which can be also connected by the corresponding discourse relations. This process can be continued until the only one node will remain (the root of the tree).
RST identifies two types of vertices: "Nucleus" and "Satellite". Vertices of the first type contain the crucial parts of the text, whereas, vertices of the second type provide some additional information. Figure 1 demonstrates an example of a discourse tree for a text from an Internet forum.

EDU Embeddings
The pretrained Deep Averaging Network was chosen to construct embeddings of elementary discourse units (text spans). This model is a variation of the Universal Sentence Encoder, proposed by Cer et al. (2018). DAN averages word embeddings and applies a stack of fully-connected layers to get the final vector representation of the text.
We also consider parts-of-speech tags as additional information about the text. We embed POStags as vectors using one-hot encoding.
The final vector representation of an EDU is the concatenation of a semantic embedding from DAN and syntactic embedding constructed due to the POS-tags.

Recursive Neural Network
A recursive neural network encodes a tree as a vector of a fixed dimension. Similar to the tree construction in RST, the encoding occurs recursively along subtrees from leaves to root. The process of obtaining an embedding of a subtree with the root in the node i can be described as follows.
Let x i denote the text embedding corresponding to the node i.
Embedding of the empty text, else Text Encoder applies a fully-connected layer to this pre-trained vector: Let nodes denoted as j and k be children indices for the node i, and r be the name of the discourse relation that characterizes the link between them. Dummy child vertices containing empty text are added for the leaves. The vector representation of the input associated with i concatenates four vectors as follows: (2) In (2) I is the indicator function. An embedding of the tree which has root in the node i is computed based on embeddings of its left and right subtrees due to the binary TreeLSTM model (Tai et al., 2015).
We use TreeLSTM with dropout regularization of recurrent networks suggested by Semeniuta et al. (2016). Formally, the model is expressed with equations (4), (5) and (6). RST Here, σ is the sigmoid function, D is the Dropout function, α is the dropout rate and * is the elementwise multiplication. The memory cell is denoted as c. There are two forget outputs since the trees are binary.
The embedding received at the root of the tree is the vector representation of the entire text.

DSNDM
We propose DSNDM -siamese model based on the recursive neural network. There are two stages of the final model.
Firstly, the embeddings of the discourse trees for each of the input texts are calculated. The trainable parameters for both texts are the same. At the next stage, the resulting embeddings are aggregated for solving the categorization task. Here, the model concatenates the calculated trees embeddings and applies a sequence of two fully-connected layers to it. The last layer utilizes the Softmax function to map input features to the class probability space.
The main advantage of the proposed model is that it is capable of end-to-end learning. Figure 2 shows the architecture of the model. In this case, it solves the fact-checking problem. At the same time, it is almost the same for question-answer systems, except its inputs: the first text is a question, and the second is an answer.

Integration of the Attention Mechanism
We suggest a way of the integration of the attention mechanism (Vaswani et al., 2017), which has gained popularity in many NLP tasks. The main idea is that a constructed embedding of a question/statement can be used to filter information while constructing an embedding of an answer/justification. Thus, at each step, the model decides information from which subtree is more useful. The attention module can be integrated into the equations of the TreeLSTM model as follows.
Let us consider the Attention module, in which the key is the vector k and the values are represented by the matrix Q. In our case, the key is the embedding of the first text. The matrix Q is composed of vectors q 1 = c j * f i0 and q 2 = c k * f i1 and has the dimension 2 × d, where d is the dimension of the memory vector. Then, instead of equation (5), the memory cell vector is recalculated using attention matrices: Here, SM is the Softmax layer which is used for normalization. In (7), multiplication by 2 is necessary to maintain a balance with equation (5). In (8), matrices W K , W Q and W V are trainable matrices of parameters of the Attention module. Equation (7) is utilized instead of (5) only to construct the embedding of the second text.

Training Techniques for Ranking
DSNDM can be used both in the text classification task and in the ranking task. In this paper, we investigate three ranking techniques. 1) Classification-based All pairs in the dataset can be divided into two groups based on relevance. The suggested model can be applied to solve the binary classification of text pairs with these groups. The ranking of the answers for each question is carried out using the class probabilities predicted by the model. The architecture of the model completely coincides with the base one in this case, and cross-entropy loss is used to train it.
2) Pointwise ranking In this case, the main task is the regression problem. Let {(q i , a i ) i=1..N } is the set of the given pairs, and {r i } are the corresponding relevance scores. Let the proposed model is denoted as DSNDM(q, a, w), where w are model parameters. Then, the model minimizes the following loss: 3) Pairwise ranking Here, the input are triplets {(q i , a + i , a − i ) i=1..M }, where the relevant and irrelevant answer are selected for each question. These triplets can be generated from pairs using relevance scores. The ranking model solves the regression problem and minimizes the loss from (11). The dataset also contains metadata with information about the politician and the global context of the statement. The LIAR-PLUS dataset can be used in four scenarios, depending on the restriction on the available data: S (only statement is used), S + M (statement and metadata), SJ (pairs: statement and justification), and S + JM (all available data). The model proposed in this paper is applied to pairs in the SJ scenario. At the same time, the model can be also used in the S scenario utilizing only the recursive neural network.
The dataset contains 12,782 statements which were split into the train, validation and test samples in the ratio of 10:1:1. This dataset is balanced, and the accuracy metric can be used to compare results.

Implementation Details
Firstly, text preprocessing was applied. We converted texts to lower case, removed extra characters and stop words. The open-source discourse parser ALT (Joty et al., 2012) was applied to the prepossessed texts to obtain discourse trees. Finally, the constructed trees were converted to the format described in section 3.1.
We used the DyNet python library to implement our model. The size of the hidden layer in LSTM cells was established at 100, the dropout rate α at 0.1, the learning rate at 0.004 and the number of units in the fully-connected layer in the Text Encoder at the dimension of x i . We chose the Adagrad optimizer which is less prone to overfitting for the assigned task. The optimal number of epochs is 4-9. The model was trained by mini-batches of 150 pairs of texts.

Experiments
The parser identified 18 unique discourse relations. The most popular relations are "Elaboration" (is chosen by default), "Attribution", "Joint" and "Same-Unit". Usually, the trivial relations are popular in texts, and the ALT parser tends to use it in uncertain cases.
We investigated the difference between relation distributions for the instances in "true" and "pantsfire" classes. The "Joint" relation is less common for truthful statements than for misleading statements (relative frequencies are 0.064 and 0.073). Thus, politicians tend to construct longer, complex sentences in the case of the deceptive statements. Besides, the "Attribution" relation is used more often for truthful statements (frequencies are 0.17 and 0.15). In the biggest part of cases, it indicates a link to the source. Thus, the relations contain some important information by themselves. We compared the model with the methods proposed in (Alhindi et al., 2018). In addition to well-known baselines (such as linear regression and SVM), BiLSTM and P-BiLSTM are considered. The last one is the siamese model based on the BiLSTM architecture. Table 1 demonstrates the results for the 6-class and binary categorization tasks.
The table shows that the DSNDM model significantly improves the results of baselines, especially in the case of the multiclass classification.
The fully-connected layer in the Text Encoder is crucial since it adds up to 0.02 to accuracy. The usage of the POS-tags embeddings also improves the overall quality approximately by 0.003-0.01.
The DSNDM model with the integrated attention module (denoted as DSNDM + Att.) reached the best results for the test set. This improvement is not significant because of the binary structure of trees (the attention module re-weights only two vectors at each node).

Error Analysis
It is worth emphasizing that in some case trees for statements contain only one node. Therefore, discourse analysis does not suffice to categorize it. For the deepest trees which contain more than 45 nodes in the statement and justification in total (there are 89 such instances in the dataset), the F1-score metric is higher than 0.46.
The confusion matrix is shown in Figure 3. It demonstrates that DSNDM mainly intermingles close labels. However, at the same time, it confuses the classes "false" and "true" in some cases.
We distinguish several types of such instances which are demonstrated in Table 3 ( A). Firstly, there are some cases when refutation partly repeats the statement. Then, the model with attention focuses mainly on the repeated part and marks the misleading statement as "true". Secondly, the justification text can be extracted inaccurately and be not sufficient to estimate the veracity of the statement. Apart from that, the justification can be complex and contain only one useful sentence like in the third example. Finally, in the last pair, justification indicates that the statement can be labeled as "false" in some general cases, but has the label "true" in the considered case. Therefore, this justification contains useless thoughts and can be provided more accurately.
Thus, the quality of the proposed model is limited by several factors: the size of the discourse trees, the quality of the discourse parser, and the quality of the provided justifications.

ANTIQUE Dataset
This dataset (Hashemi et al., 2020) contains nonfactoid questions with a set of possible answers for each of them. The authors selected questions from the Yahoo! Webscope L6 (nfL6) database. The questions were preliminary filtered: short questions, duplicates, and some complex cases were removed.
The corpus contains 2,426 questions in the training sample and 200 in the test sample. Answers for each question were selected both from the question forum thread and from other threads using the BM25 algorithm. In this way, 27,422 answers were allocated for training, and 6,589 instances for testing.
The resulting QA pairs were labeled on a 4-point scale depending on the relevance of answers using the crowdsourcing procedure. The authors also proposed a binary classification task where instances with labels 1 and 2 can be considered as irrelevant, and instances with labels 3 and 4 can be considered as as relevant. Thus, the most common ranking metrics such as MAP and MRR can be used in the second task. At the same time, the multiclass metric nDCG can be also considered. The number of the best answers for questions differs, but on average it is approximately equal to 8.
The dataset is not balanced: the number of relevant answers is almost twice bigger than irrelevant ones. The authors used a negative sampling procedure to train baseline models, increasing the size of the dataset several times. However, it is important to emphasize that these additional QA pairs were not included in the publicly available dataset.
Questions are not very long and contain about 11 words on average. At the same time, the answers are much longer and contain more than 47 words on average. Therefore, it can be problematic to use the standard BERT model, but it is an advantage for the discourse analysis.

Implementation Details
The implementation details are almost the same as described in Sect. 4.1.2 except for some hyperparameters. It is better to choose the smaller dimension of the hidden vectors. The dimension of vectors in TreeLSTM was set to 100, and in the TextEncoder layer was set to 64. It takes 1-3 epochs to achieve optimal quality. A tenth of the training set was used as a validation sample during training.

Experiments
The discourse parser identified 18 different discourse relations like in the first task. However, in this case, the frequency statistics of relations are very similar for different classes. It is due to the fact that in this task the second text (answer) is not auxiliary.
We compared the suggested model with the baselines presented in (Hashemi et al., 2020). It should be highlighted that these baselines were trained on the extended dataset. The authors additionally performed the negative sampling procedure. Therefore, it is not correct to compare the results ob- 0.7267 0.6000 tained on the available base dataset with the results obtained on the extended dataset. Apart from that, we considered several models discussed in (MacAvaney et al., 2020). In this paper, several negative examples were also added for each question. However, they were most likely selected only from the training corpus, since the authors were unable to reproduce the BERT results from the original paper. MacAvaney et al. (2020) proposed various modifications of the training loss by adding a weight for each pair. We do not compare with the results obtained with a modified curriculum since we consider only the basic pointwise and pairwise losses.
The comparison results are presented in Table  2. It shows that DSNDM + Att. model trained using pairwise loss achieves high MRR and P@1 metrics. Its results are superior to the results of the best BERT model presented in (MacAvaney et al., 2020). We also trained BERT ourselves and obtained results close to it, and we could not reproduce the results from the original paper too. Also, the pointwise ranking performed better than the classification-based method.
The attention mechanism improved quality in all cases, especially for the pointwise and pairwise techniques.

Error Analysis
We investigated the mistakes of DSNDM trained for the classification problem. Figure 4 shows the Statistics are calculated only for pairs for which the model predicts a probability that exceeds the selected threshold. One can see that for both questions and answers, the number of nodes in discourse trees for correctly classified pairs is greater than for incorrectly classified ones. Thus, DSNDM makes wrong predictions mostly for small trees. Also, the plot for questions demonstrates that the model's greater confidence in the wrong answer is frequently triggered by the smaller size of the question tree. Therefore, the quality of the proposed model is closely related to the size of the discourse trees for this task too.
In this case, we distinguish several typical mistakes which are demonstrated in Table 4 (see Appendix A). In the first pair, the question contains only a few significant keywords, and the model focuses mainly on them. Despite the fact that the answer is irrelevant and unrelated to the question area, it often uses the same keywords. Thus, similar EDU embeddings do not contribute to the correct classification. In the second example, the meaning of the answer and the question is the opposite. That is, despite the correctness of the answer, its text refutes the information in the question. If the question contains only one node, then such instance is one of the most difficult for analysis. Finally, the last example demonstrates that in some cases the correct answers may be formulated in the way not expected by the authors of the questions. Thus, the quality of the model is also limited by the variability of possible answers.

Conclusion and Future Work
In this paper, we investigated the utility and importance of the discourse analysis for text pairs categorization and ranking. We considered two typical tasks in which discourse analysis seems promising: automatic verification of political statements and ranking in question answering systems.
We modified TreeLSTM to effectively encode discourse trees and proposed DSNDM which is capable of processing pairs of texts. In addition, the integration of the attention mechanism in the proposed model was suggested to obtain more useful embeddings of subtrees. Moreover, we investigated three training techniques for the ranking task.
The experiments were performed on the LIAR-PLUS and ANTIQUE datasets. DSNDM efficiently learned the match between discourse tree structures and achieved high quality in both tasks. Besides, the attention module improved the metrics of the base model in all cases. The error analysis showed that the model processes deeper trees more successfully.
There are possible directions for future work: the use of trees not only of a binary structure, the modification of vector representations of EDUs, as well as the investigation of the performance of DSNDM in other various tasks where discourse analysis may be helpful, e.g. machine translation, chat-bots and other QA systems. Apart from that, we will experiment with other hierarchical structures (e.g. syntactic) for deeper analysis of the importance of the RST-based structure in the proposed model. Not what is going to happen this year. Our rating It's been a longer and colder winter than in recent years. But that doesn't erase a trend that's been well -established. The number of days that the lakes have ice on them -making them safe for ice fishing -has declined. true Table 3: Typical mistakes of DSNDM on the LIAR test set where the model confuses "true" and "false" instances.

Question
Answer Label how does disneyland make it snow?
Well if you are using snow, just lay on you back in it and move your arms from your sides to the top of you head and open and close you legs a few times... to make snow angels!!!! Correct answer Table 4: Typical mistakes of DSNDM on the ANTIQUE test set where the model confuses "Correct answer" and "Out of context" instances. .