Automatic Stance Detection Using End-to-End Memory Networks

We present an effective end-to-end memory network model that jointly (i) predicts whether a given document can be considered as relevant evidence for a given claim, and (ii) extracts snippets of evidence that can be used to reason about the factuality of the target claim. Our model combines the advantages of convolutional and recurrent neural networks as part of a memory network. We further introduce a similarity matrix at the inference level of the memory network in order to extract snippets of evidence for input claims more accurately. Our experiments on a public benchmark dataset, FakeNewsChallenge, demonstrate the effectiveness of our approach.


Introduction
Recently, an unprecedented amount of false information has been flooding the Internet with aims ranging from affecting individual people's beliefs and decisions (Mihaylov et al., 2015a,b;Mihaylov and Nakov, 2016) to influencing major events such as political elections (Vosoughi et al., 2018).Consequently, manual fact checking has emerged with the promise to support accurate and unbiased analysis of public statements.
As manual fact checking is a very tedious task, automatic fact checking has been proposed as an alternative.This is often broken into intermediate steps in order to alleviate the task complexity.One such step is stance detection, which is also useful for human experts as a stand-alone task.The aim is to identify the relative perspective of a piece of text with respect to a claim, typically modeled using labels such as agree, disagree, discuss, and unrelated.Figure 1 shows some examples.
Here, we address the problem using a novel model based on end-to-end memory networks (Sukhbaatar et al., 2015), which incorporates convolutional and recurrent neural networks, as well as a similarity matrix.Our model jointly addresses the problems of predicting the stance of a text document with respect to a given claim, and of extracting relevant text snippets as support for the prediction of the model.We further introduce a similarity matrix, which we use at inference time in order to improve the extraction of relevant snippets.
The experimental results on the Fake News Challenge benchmark dataset show that our model, which is very feature-light, performs similarly to the state of the art, which is achieved by more complex systems.Our contributions can be summarized as follows: (i) We apply a novel memory network model enhanced with CNN and LSTM networks for stance detection.(ii) We further propose a novel extension of the general architecture based on a similarity-based matrix, which we use at inference time, and we show that this extension offers sizable performance gains.(iii) Finally, we show that our model is capable of extracting meaningful snippets from the input text document, which is useful not only for stance detection, but more importantly can be useful for human experts who need to decide on the factuality of a given claim.
Long-term memory is necessary in order to determine the stance of a long document with respect to a claim, as relevant parts of a document -paragraphs or text snippets-can indicate the perspective of a document with respect to a claim.Memory networks were designed to remember past information (Sukhbaatar et al., 2015) and they can be particularly well-suited for stance detection since they can use a variety of inference strategies alongside their memory component.
In this section, we present a novel memory network (MN) for stance detection.It contains a new inference component that incorporates a similarity matrix to extract, with better accuracy, textual snippets that are relevant to the input claims.

Overview of the network
A memory network is a 5-tuple {M, I, G, O, R}, where the memory M is a sequence of objects or representations, the input I is a component that maps the input to its representation, the generalization component G (Sukhbaatar et al., 2015) updates the memory with respect to new input, the output O generates an output for each new input and the current memory state, and finally, the response R converts the output into a desired response format, e.g., a textual response or an action.These components can potentially use many different machine learning models.
Our new memory network for stance detection is a 6-tuple {M, I, F, G, O, R}, where F represents the new inference component.It takes an input document d as evidence and a textual statement s as a claim and converts them into their corresponding representations in the input I.Then, it passes them to the memory M .Next, the relevant parts of the input are identified in F , and afterwards they are used by G to update the memory.Finally, O generates an output from the updated memory, and converts it to a desired response format with R. The network architecture is depicted in Figure 2. We describe the components below.

Input Representation Component
The input to the stance detection algorithm is a document d and a textual statement s as a claim: see lines 2 and 3 in Table 1.Each d is segmented into paragraphs x j of varied lengths, where each x j is considered as a potential piece of evidence for stance detection.Inputs: 2 (1) A document (d) as a set of evidence (xj) 3 (2) A textual statement containing a claim (s) 4 Outputs: 5 (1) predicting the relative perspective (or stance) of a pair of (d, s) to a claim as agree, disagree, discuss and unrelated.6 Inference outputs: 7 (2) Top K evidence xj with their similarity scores 8 (3) Top K snippets of xj with their similarity scores 9 Memory Network Model: 10 1. Input memory representation (I): Indeed, a paragraph usually represents a coherent argument, unified under one or more inter-related topics.The input component in our model converts each d into a set of potential pieces of evidence in a three-dimensional (3D) tensor space as shown below (see line 11 in Table 1): where X = {x 1 , ..., x n } is a set of paragraphs considered as potential pieces of evidence, such that each x j is represented by a set of words W = {w 1 , ..., w v }-global vocabulary of size v-and a set of neural representations E = {e 1 , ..., e v } for words in W .This 3D space is illustrated as a cube in Figure 2.Each x j is encoded from the 3D space into a semantic representation at the input component using a Long Short-Term Memory (LSTM) network.The lower left component in Figure 2 shows our LSTM network, which operates on our input as follows (see also line 12 in Table 1): where m j is the LSTM representation of x j , and TimeDistributed() indicates a wrapper that enables training the LSTM over all pieces of evidence by applying the same LSTM model to each time-step of a 3D input tensor, i.e., (X, W, E).
While LSTM networks are designed to effectively capture and memorize their inputs (Tan et al., 2016), Convolutional Neural Networks (CNNs) emphasize the local interaction between the words in the input word sequence, which is important for obtaining an effective representation.We use a CNN to encode each x j into its representation c j as shown in Equation 3 (see line 13 in Table 1).
(X, W, E) (3) The left-top of Figure 2 shows that this representation is passed as a new input to the component M of our memory network.
We keep track of the computed n-grams from the CNN, so that we can use them later in the inference and in the response components (see Sections 2.3 and 2.6).For this purpose, we use a Maxout layer (Goodfellow et al., 2013) to take the maximum across k affine feature maps computed by the CNN, i.e., pooling across channels.Previous work has investigated the combination of convolutional and recurrent representations, which is then fed to the other network as input (Tan et al., 2016;Donahue et al., 2015;Zuo et al., 2015;Sainath et al., 2015).In contrast, we feed their individual outputs into our memory network separately, and let the network decide which representation helps the target task better.We show the effectiveness of this choice below.
Similarly, we convert each input claim s to its representation using the corresponding LSTM and CNN networks, as follows: where s lstm and s cnn are the representations of s computed using LST M and CN N networks, respectively.Note that these are separate networks with different parameters from those used to encode the pieces of evidence.Lines 10-14 of Table 1 describe the above steps in representing I in our memory network.We encode each input document d into a set of pieces of evidence {x j }∀j: it computes LSTM and CNN representations, m j and c j , respectively, for each x j , and LSTM and CNN representations, s lstm and s cnn , for each claim s.

Inference Component
The resulting representations are used to compute semantic similarity between claims and pieces of evidence.We define the similarity P j lstm between s and x j as follows (see also line 17 in Table 1): where s lstm ∈ R q and m j ∈ R d are LSTM representations of s and x j , respectively, and M ∈ R q×d is a similarity matrix capturing their similarity.For this purpose, M maps s and x j into the same space as shown in Figure 3. M is a set of q × d parameters of the network, which are optimized during training.
In a similar fashion, we compute the similarity P j cnn between x j and s using the CNN representations as follows (see line 19 of Table 1): where s cnn ∈ R q and c j ∈ R d are the representations of s and x j obtained with CNN, respectively.The similarity matrix M ∈ R q ×d is a set of q × d parameters of the network and is optimized during training.P j lstm and P j cnn indicate the claim-evidence similarity vectors computed based on the LSTM and on the CNN representations of s and x j , respectively.
The rationale behind using the similarity matrix is that in our memory network model, as Figure 3 shows, we look for a transformation of the input claim s such that s = M × s in order to obtain the closest facts to the claim.
In fact, the relevant parts of the input document with respect to the input claim can be captured at a different level, e.g., using M for the n-gram level or using the claim-evidence P j lstm or P j cnn , ∀j at the paragraph level.We note that (i) P j lstm uses LSTM to take the word order and long-length dependencies into account, and (ii) P j cnn exploits CNN to take n-grams and local dependencies into account, as explained in sections 2.2 and 2.3.Additionally, we compute another semantic similarity vector, P j tfidf , by applying a cosine similarity between the TF.IDF (Spärck Jones, 2004) representation of x j and s.This is particularly useful for stance detection as it can help detect the unrelated pieces of evidence.

Memory and Generalization Components
The information flow and updates in the memory is as follows: first, the representation vector {m j }∀j is passed to the memory and updated using the claim-evidence similarity vector {P j tfidf }: The goal is to filter out most unrelated evidence.The updated m j in conjunction with s lstm are used by the inference component-component F to compute {P j lstm } as explained in Section 2.3.
Then, {P j lstm } is used to update the new input set {c j }∀j to the memory: Finally, the updated c j in conjunction with s cnn are used to compute P j cnn as explained in Sec.2.3.

Output Representation Component
In memory networks, the memory output depends on the final goal, which, in our case, is to detect the relative perspective of a document to a claim.For this purpose, we apply the following equation: where mean({c j }) is the average vector of the c j representations.
Then, we compute the maximum and the average similarity between each piece of evidence and the claim using P j tfidf , P j lstm and P j cnn , which are computed for each evidence and claim in the inference component F .The maximum similarity identifies the part of document x j that is most similar to the claim, while the average similarity measures the overall similarity between the document and the claim.

Response and Output Generation
This component computes the final stance of a document with respect to a claim.For this purpose, the concatenation of vectors o, s lstm and s lstm , are fed into a Multi-Layer Perceptron (MLP), where a softmax predicts the stance of the document with respect to the claim, as shown below (see also lines 22-23 in Table 1): where δ is a softmax function.In addition to the resulting stance, we extract snippets from the input document that best indicates the perspective of the document with respect to the claim.For this purpose, we use P j lstm , P j cnn and M as explained in Section 2.3 (see also lines 24-26 of Table 1).
The overall model is shown in Figure 2 and a summary of the model is presented in Table 1.All model parameters, including those of (i) CNN and LSTM in I, (ii) the similarity matrices M and M in F , and (iii) the MLP in R, are jointly learned during the training process.
3 Experiments and Evaluation

Data
We use the dataset provided by the Fake News Challenge, 1 where each example consists of a claim-document pair with the following possible relationship: agree (the document agrees with the claim), disagree (the document disagrees with the claim), discuss (the document discusses the same topic as the claim, but does not take a stance with respect to the claim), unrelated (the document discusses a different topic).The data includes a total of 75.4K claim-document pairs, which link 2.5K unique articles with 2.5K unique claims, i.e., each claim is associated with 29.8 articles on average.

Settings
We use 100-dimensional word embeddings from GloVe (Pennington et al., 2014), which were pretrained on two billion tweets.We use Adam as an optimizer and categorical cross entropy as a loss function.We further use 100-dimensional units for the LSTM embeddings, and 100 feature maps with filter width of size 5 for the CNN.We consider the first p=9 paragraphs for each document, where p is the median of the number of paragraphs.
We optimize the hyper-parameters of the models using the same validation dataset (20% of the training data).Finally, as the data is largely imbalanced towards the unrelated class, during training we randomly select an equal number of instances from each class for each epoch.

Evaluation Measures
We use the following evaluation measures: Accuracy: Number of correctly classified examples divided by the total number of examples.It is equivalent to micro-averaged F 1 .
Macro-F1: We calculate F 1 for each class, and then we average across all classes.
Weighted Accuracy: This is a weighted, twolevel scoring scheme, which is applied to each test example.First, if the example is from the unrelated class and the model correctly predicts it, the score is incremented by 0.25; otherwise, if the example is related and the model predicts agree, disagree, or discuss, the score is incremented by 0.25.Second, there is a further increment by 0.75 for each related example if the model correctly predicts the correct label: agree, disagree, or discuss.
1 Available at www.fakenewschallenge.orgFinally, the score is normalized by dividing it by the total number of test examples.The rationale behind this metric is that the binary related/unrelated classification task is expected to be much easier, while also being arguably less relevant to fake news detection, than the actual stance detection task, which aims to further classify the relevant instances as agree, disagree, or discuss.Therefore, the weighted accuracy metric gives more weight to the former distinction and less weight to the latter one.

Baselines
Given the imbalanced nature of our data, we use two baselines, in which we label all testing examples with the same label: (a) unrelated and (b) discuss.The former is the majority class baseline, which is a reasonable baseline for Accuracy and macro-F 1 , while the latter is a potentially better baseline for Weighted Accuracy.
We further use CNN and LSTM models, as well as combinations thereof, as baselines since they form components of our model, and also because they yield state-of-the-art results for text, image, and video classification (Tan et al., 2016;Donahue et al., 2015;Zuo et al., 2015;Sainath et al., 2015).
Finally, we include the official baseline from the challenge, which is a Gradient Boosting classifier with word and n-gram overlap features, as well as indicators for refutation and polarity.

Our Models
sMemNN: This is our model presented in Figure 2. Note that unlike the CNN+LSTM and the LSTM+CNN baselines above, which feed the output of one network into the other one, the sMemNN model feeds the individual outputs of both the CNN and the LSTM networks into the memory network, and lets it decide how much to rely on each of them.This consideration also facilitates reasoning and explaining model predictions, as we will discuss in more detail below.
sMemNN (dotProduct): This is a version of sMemNN, where the similarity matrices are replaced by the dot product between the representation of the claims and of the evidence.For this purpose, we first project the claim representation to a dense layer that has the same size as the representation of each piece of evidence, and then we compute the dot product between the resulting representation and the representation of the evidence.sMemNN (with TF): Since our LSTM and CNN networks only use a limited number of starting paragraphs2 for an input document, we enrich our model with the BOW representation of documents and claims as well as their TF.IDF-based cosine similarity.These vectors are concatenated with the memory outputs (section 2.5) and passed to the R component (section 2.6) of sMemNN.We expect these BOW vectors to provide useful additional information.

Results
Table 2 reports the performance of all models on the test dataset.The All-unrelated and the Alldiscuss baselines perform poorly across the evaluation measures, except for All-unrelated, which achieves high accuracy, which is due to unrelated being by far the dominant class in the dataset.
Next, we can see that LSTM consistently outperforms CNN across all evaluation measures.Although the larger number of parameters of the LSTM can play a role, we believe that its superiority comes from it being able to remember previously-observed relevant pieces of text.
Next, we see systematic improvements for the combinations of CNN and LSTM: CNN+LSTM is better than CNN alone, and LSTM+CNN is better than LSTM alone.Better performance is achieved by LSTM+CNN, that is, when claims and evidence are first processed by an LSTM network, and then fed into a CNN.
The Gradient Boosting model achieves sizable improvement over the above baseline neural models.However, we should note that these neural models do not use the rich hand-crafted features that were used in the Gradient Boosting model.
Row 9 shows the results for our memory network model (sMemNN), which consistently outperforms all other baseline models across all evaluation metrics, achieving 10.62 and 3.77 points of absolute improvement in terms of Macro-F1 and Weighted Accuracy, respectively, over the best baseline (Gradient Boosting).We believe that this is due to the memory network's capturing good text snippets.As we will see below, these snippets are also useful for explaining the model's predictions.Comparing row 9 to row 8, we can see the importance of our proposed similarity matrix: replacing that matrix by a simple dot product hurts the performance of the model considerably across all evaluation measures, thus lowering it to the level of the Gradient Boosting model.
Finally, row 10 shows the results for our memory network model enriched by a BOW representation.As we expected, it outperforms sMemNN, probably due to being able to capture useful information from paragraphs beyond the starting few.
To put the results of sMemNN in perspective, we should mention that the best system at the Fake News Challenge achieved a macro-F1 of 57.79, which is not significantly different from the performance of our full model at the 0.05 significance level (p-value=0.53).Yet, they have an ensemble combining the feature-rich Gradient Boosting system with neural networks.
Further analysis of the output of the different systems (e.g., the confusion matrices) reveals the following general trends: (i) the unrelated examples are easy to detect, and most models show high performance for this class, (ii) the agree and the disagree examples are often mislabeled as discuss by the baselines, and (iii) the disagree examples are the most difficult ones for all models, probably because they represent by far the smallest class.
Claim 1: man saved from bear attack -thanks to his justin bieber ringtone Evidence Id P j cnn Evidence Snippet 2069-3 0.89 ... fishing in the yakutia republic , russia , igor vorozhbitsyn is lucky to be alive after his justin bieber ringtone , baby , scared off a bear that was attacking him 0.41 ... 2069-7 1.0 ... but as the bear clawed vorozhbitsyn ' s face and back his mobile phone rang , the ringtone selected was justin bieber ' s hit song baby .rightly startled 1.00 , the bear retreated back into 0.39 the forest ... true label: agree; predicted label: agree Claim 2: 50ft crustacean , dubbed crabzilla , photographed lurking beneath the waters in whitstable Evidence Id P j cnn Evidence Snippet 24835-1 0.0046 ... a marine biologist has killed off claims -0.0008 that a giant crab is 0.0033 living on the kent coast -insisting the image is probably a well -doctored hoax 0.0012 ... 24835-7 -0.0008 ... i don ' t know what the currents are like around that harbour or what sort of they might produce in the sand , but i think it ' s more conceivable that someone is playing 0.0007 about with the photo ... true label: disagree; predicted label: disagree Table 3: Examples of highly ranked snippets of evidence for an input claim, which were automatically extracted by our inference component for claim-document pairs.The P j cnn column and the values in the top-right corner of the highlighted snippets show the similarity between the claim and a piece of evidence, and between the claim and an evidence snippet, respectively.

Training Data Coverage
As discussed previously, we balance the data at each training iteration by randomly selecting z instances from each of the four target classes, where z is the size of the class with the minimum number of training instances.In this experiment, we investigate what proportion of the training data got actually used when following our sampling procedure.For this purpose, at each training iteration, we report the proportion of the training instances from each class that were used so far, either at the current or at any of the previous iterations.
As Figure 4 shows, our random data sampling procedure eventually used almost all training examples.Since the disagree class was the smallest, its examples remained fully covered throughout the process.Moreover, almost all other related examples, i.e., agree and discuss, were observed during training, as well as a large fraction of the dominating unrelated examples.Note that the model achieved its best (lowest) loss on the validation dataset at iteration 31, when almost all related instances had already been observed.This happened while the corresponding fraction for the unrelated pairs was around 50%, i.e., a considerable number of the unrelated instances were not really needed.

Explainability
A major advantage of our model, compared to the baselines and to most related work, is that it can explain its predictions: as we explained in section 2.3, our inference component predicts the similarity between each piece of evidence x j and the claim s at the n-grams-level using the claimevidence similarity vector P j cnn .Table 3 shows examples of two claims and the snippets extracted as evidence.Column P j cnn shows the overall similarity between the evidence and the corresponding claim as computed by the inference component of our model.The highlighted texts are snippets with the highest similarity (the value is shown next to each snippet) to the claim as extracted by the inference component.
Note that the snippets are of fixed length, namely 5-grams, but in case of consecutive n-grams with similar scores, we combine them into a single snippet and we report the average value, e.g., see the snippet for evidence 2069-3.The lower half of Table 3 shows an example where the similarity values associated with the snippets are either too small or negative, e.g., see the value for biologist has killed off claims.In all cases, the model could accurately predict the stance of these pieces of evidence with respect to the corresponding claims.
Next, we conducted an experiment to quantify the performance of our memory network at explaining its predictions: we randomly sampled 100 agree/disagree claim-document examples from our gold data, and we manually evaluated the top five pieces of evidence that our model provided.In 76 cases, the model correctly classified the agree/disagree examples, and provided arguably adequate snippets.
Figure 5(a) shows the performance of our model at explaining its predictions when each supporting/opposing piece of evidence is an n-gram snippet of fixed length (n = 5) for the agree and the disagree classes, and their combinations at the topk ranks, k = {1, . . ., 5}.It achieved precision of 0.28, 0.32, 0.35, 0.25, and 0.33 at ranks Moreover, we found that it could accurately identify, as part of the identified n-grams, key phrases such as officials declared the video, according to previous reports, believed will come, president in his tweets as supporting pieces of evidence, and proved a hoax, shot down a cnn report, would be skeptical as opposing pieces of evidence.
Note that the above low precision is mainly due to the unsupervised nature of this task as no gold snippets supporting the document's stance are available for training in the FNC dataset. 3Furthermore, our evaluation setup was at the n-gram level in Figure 5(a).However, if we conduct a more coarse-grained evaluation where we combine consecutive n-grams with similar scores into a single snippet, the precision for these new snippets improves to 0.4, 0.38, 0.42, 0.38, and 0.42 at ranks 1-5, as Figure 5(b) shows.If we further extend the evaluation to the sentence level, the precision jumps to 0.6, 0.58, 0.55, 0.62, and 0.57 at ranks 1-5, as we can see on Figure 5(c).
3 Some other recent datasets, to be presented at this same HLT-NAACL'2018 conference, do have such gold evidence annotations (Baly et al., 2018;Thorne et al., 2018).

Related Work
While stance detection is an interesting task in its own right, e.g., for media monitoring, it is also an important component for fact checking and veracity inference.4Automatic fact checking was envisioned by Vlachos and Riedel (2014) as a multistep process that (i) identifies check-worthy statements (Hassan et al., 2015;Gencheva et al., 2017;Jaradat et al., 2018), (ii) generates questions to be asked about these statements (Karadzhov et al., 2017), (iii) retrieves relevant information to create a knowledge base (Shiralkar et al., 2017), and (iv) infers the veracity of these statements, e.g., using text analysis (Banerjee and Han, 2009;Castillo et al., 2011;Rashkin et al., 2017) or information from external sources (Karadzhov et al., 2017;Popat et al., 2017).
There have been some nuances in the way researchers have defined the stance detection task.
SemEval-2016 Task 6 (Mohammad et al., 2016) targets stances with respect to some target proposition, e.g., entities, concepts or events, as infavor, against, or neither.The winning model in the task was based on transfer learning: a Recurrent Neural Network trained on a large Twitter corpus was used to predict task-relevant hashtags and to initialize a second recurrent neural network trained on the provided dataset for stance prediction (Zarrella and Marsh, 2016).Subsequently, Zubiaga et al. (2016) detected the stance of tweets toward rumors and hot topics using linear-chain conditional random fields (CRFs) and tree CRFs that analyze tweets based on their position in treelike conversational threads.
Most commonly, stance detection is defined with respect to a claim, e.g., as in the 2017 Fake News Challenge.The best system was an ensemble of gradient-boosted decision trees with rich features and CNNs (Baird et al., 2017).The second system was a multi-layer neural network with similarity features, word n-grams, and latent semantic analysis (Hanselowski et al., 2017).The third one was a neural network with similarity features (Riedel et al., 2017).
Unlike the above work, we use a feature-light memory network that jointly infers the stance and highlights relevant snippets of evidence.

Conclusion
We studied the problem of stance detection, which aims to predict whether a document supports, challenges, or just discusses a given claim.The nature of the task clearly shows that, in order to go beyond simple matching between stance (short text) and evidence (longer text, e.g., an entire document), a machine learning model needs to focus on the relevant paragraphs of the evidence.Moreover, in order to understand whether a paragraph supports a claim, there is a need to refer to information available in other paragraphs.CNNs and LSTMs are not well-suited for this task as they cannot model complex dependencies such as semantic relationships with respect to entire previous paragraphs.In contrast, memory networks are exactly designed to remember previous information.However, given the large size of documents and paragraphs, basic memory networks do not handle well irrelevant and noisy information, which we confirmed in our experimental results.Thus, we proposed a novel extension of the basic memory networks, which is based on a similarity matrix and a stance filtering component, which we apply at inference time, and we have shown that this extension offers sizable performance gains, making memory networks competitive.Moreover, our model can extract meaningful snippets from documents that can explain the factuality of a given claim.
In future work, we plan to extend the inference component to select an optimal set of explanations for each prediction, and to explain the model as a whole, not only at the instance level.

Figure 1 :
Figure 1: Examples of snippets of text and their stances with respect to a given claim. 1

Figure 2 :
Figure 2: The architecture of our Memory Network model for stance detection.
Figure 3: Matching a claim s and a piece of evidence x j using a similarity matrix M .Here, s lstm and s cnn are LSTM and CNN representations of s, whereas m j and c j are LSTM and CNN representations of x j .

Figure 4 :
Figure 4: Effect of data coverage.The y-axis shows the fraction of data observed during training (coverage), while the x-axis shows the loss during training.

Figure 5 :
Figure 5: Prediction explainability.Sub-figures (a)-(c) show the precision of our model explaining its prediction the pieces of evidence are (a) fixed-length n-grams (n = 5), (b) combinations of several consecutive n-grams with similar scores, or (c) the entire sentence, if it includes at least one extracted n-gram snippet.
Claim: Robert Plant Ripped up $800M Led Zeppelin Reunion Contract.

Table 1 :
Summary of our Memory Network algorithm for stance detection.

Table 2 :
Evaluation results on the test data.