Chains-of-Reasoning at TextGraphs 2019 Shared Task: Reasoning over Chains of Facts for Explainable Multi-hop Inference

This paper describes our submission to the shared task on “Multi-hop Inference Explanation Regeneration” in TextGraphs workshop at EMNLP 2019 (Jansen and Ustalov, 2019). Our system identifies chains of facts relevant to explain an answer to an elementary science examination question. To counter the problem of ‘spurious chains’ leading to ‘semantic drifts’, we train a ranker that uses contextualized representation of facts to score its relevance for explaining an answer to a question. Our system was ranked first w.r.t the mean average precision (MAP) metric outperforming the second best system by 14.95 points.


Introduction
Machine reading comprehension (MRC), the ability of computing systems to read and understand text, has been a long-standing goal of natural language understanding. Question answering (QA) provides a natural way of testing a models capability to comprehend text -by probing it with queries about information present in the text. Research in MRC has seen rapid progress in recent years, with systems matching or out-performing human performance in many QA datasets (Chen et al., 2016;Devlin et al., 2018;Yang et al., 2019;Liu et al., 2019). However, a series of recent work has demonstrated the brittleness of these systems showing that they perform shallow pattern matching (Jia and Liang, 2017;Kaushik and Lipton, 2018;Chen and Durrett, 2019).
QA models that can provide explanations behind its answers can give us better insights into this brittleness. More importantly, it has also been shown that providing explanations for an answer also increases how much a user trusts the 1 https://github.com/umanlp/tg2019task 2 https://github.com/ameyagodbole/multihop_ inference_explanation_regeneration A balance is used for measuring mass;weight of an object; of a substance A graduated cylinder is used to measure volume of a liquid; of an object A graduated cylinder is a kind of instrument for measuring volume of liquids or objects Comparing requires measuring Marble is a kind of object; material A student wants to compare the masses and volumes of three marbles. Which two instruments should be used? Answer: Balance and graduated cylinder model (Herlocker et al., 2000;Dzindolet et al., 2003;Ribeiro et al., 2016). Therefore, as more of these models are deployed into real-world applications, providing explanations should be a necessary component of every QA system.
In this paper, we describe the system that we submitted to the shared task on "Multi-hop Inference Explanation Regeneration". Our model scores chains of facts (sentences) relevant to explain an answer to an elementary science exam question. Given the question and its answer, we use a standard information retrieval (IR) system to retrieve a set of starting facts from the given corpus. We adopt the "explanation graphs" representation proposed by Jansen et al. (2018), in which sentences can be viewed as nodes in a graph and two nodes are connected if they have a lexical overlap between them. Therefore, finding the explanation for a question can be viewed as finding the 'relevant subgraph' within the bigger graph of facts and the explanation corresponds to the nodes (facts) comprising the subgraph. For example, figure 1 shows a subgraph of facts explaining the answer to the given question.
We decompose the problem of finding the subgraph into finding the chain of relevant nodes that are important to answering the question and then A student wants to compare the masses and volumes of three marbles. Which two instruments should be used? (A) Balance and graduated cylinder (B) Centimeter ruler and thermometer (C) Graduated cylinder and centimeter ruler (D) Thermometer and balance Question WorldTree corpus TF-IDF -centimeter is a unit of measurement -temperature is a measure of heat energy -tape measure is a kind of tool…...

…….and 41 others
-eating food that contains pesticides can have a negative impact on humans -a human can pedal a bicycle -humans discarding waste in an environment causes harm to that environment …….and 92 others -state of matter has no impact on mass -gravitational forces causes objects that have mass to be pulled down -gram is a kind of unit for measuring mass ….and 35 others -each of the moon's phases usually occurs once per month -butterfly can live for one month -months are a unit for measuring time …...and 8 others a. Get top-K sentences via TF-IDF b. Get all outgoing edges from the retrieved sentences in the graph. Each sentence is connected to all the facts in the corresponding boxes (via lexical overlap).
a graduated cylinder is used to measure volume of a liquid; of an object a rain gauge is a kind of graduated cylinder a student is a kind of human a balance is used for measuring mass;weight of an object; of a substance a new season occurs once per three months …… and K-5 more initial set of sentences Outgoing facts Figure 2: Overview of our approach. Given a question and its answer, we find a set of initial relevant facts via a simple TF-IDF retriever. Next, from each of this relevant fact, we look at other outgoing facts (shown in the colored boxes). Note that, there are a lot of outgoing facts and most of them are spurious w.r.t the given question. Using the annotations provided, we train a supervised ranker to identify the relevant chains needed to explain an answer.
combining the top-ranked chains to reconstruct the explanation subgraph. Starting with the initial set of nodes (facts) returned by the IR system, we look at all the facts that can be hopped from them. However as can be seen from figure 2, most of these chains of facts are not relevant for answering the question and are hence 'spurious' in nature. For example, combining the two facts (that are connected via lexical overlap) "a graduated cylinder is used to measure volume of a liquid" and "centimeter is a unit of measurement" does not provide a valid explanation for the question. Models, attempting to do reasoning over such spurious chain of facts often fall into the phenomenon of 'semantic drift ' (Fried et al., 2015). To counter such spurious chains, we use the annotations provided in the WorldTree corpus (Jansen et al., 2018) to train a re-ranker that scores whether a pair of facts is relevant to answer a question. Our re-ranker encodes a pair of facts with the question and obtains question-aware contextualized representation from a pre-trained BERT (Devlin et al., 2018) language model. We also find that an even simpler model that scores each facts independently (instead of a chain) is also very competitive and a combination of both the models performs the best. Overall our simple technique achieves a score of 56.25% MAP score outperforming the second best entry by 14.95 points.

Task
Given a question and the correct answer, the task is to find a set of sentences that explain the answer to the question. The task can easily be transformed into a ranking setup where the goal is to rank the relevant facts over all other facts present in the corpus. The evaluation metric used for the task is the widely used and robust mean average precision (MAP) metric. Data: The data in the shared task comes from the WorldTree corpus that contains elementary science questions from the ARC corpus (Clark et al., 2018). A key feature of the WorldTree corpus is that it contains detailed annotation stating whether a fact is a part of the explanation for each question. Following our prior graphical representation, these sentences form an 'explanation subgraph'. Although, we do not use this, but the corpus also contains annotations about whether a fact is 'central', or provides 'grounding' or serves as a 'lexical glue' for explaining the answer to the question.

Model
Our model consists of two major components -(a) a simple IR system that retrieves the initial set of evidence facts from the corpus and serves as a starting point for further exploration and (b) a BERT based ranker that scores a pair of facts w.r.t a question (and the correct answer). For our cur-rent system, the IR component is a simple tf-idf based retriever that takes in the concatenation of question and the correct answer string and returns the top-k ranked facts in the corpus. Although we find that this simple retriever is effective, our design is agnostic to the choice of the retriever and can be replaced with any sophisticated IR system. The retrieved facts will often miss key facts that are required for explaining an answer. This could be because there is no lexical overlap between the fact and the question, or because the relevant facts are just ranked low. For example, as shown in figure 2, two important explanatory facts were missed by the initial IR system ("Computing requires measuring" ('central') and "Marble is a kind of object; material" ('grounding')). However, we note that if we also consider the outgoing facts from all the retrieved sentences, then the recall of the system increases significantly. For example, on considering all the 1-hop neighbors from the top 25 initial facts retrieved by tf-idf, 94.48% of ground truth facts were covered for a question, on average 3 . This motivated us to consider reasoning over a chain of facts together. Reasoning over chains of facts has also been shown to be an effective technique for reasoning over knowledge bases (Lao et al., 2011;Neelakantan et al., 2015;Das et al., 2017Das et al., , 2018. In our graphical representation, edges are as a result of lexical overlap between sentences. In making edges between facts, we ignore the stop words and "filler" words that were added in the annotation process. Despite this the resulting graph is very dense, and hence, each node is associated with many neighbors and not all of the facts are important for answering the question. These kind of spurious chain of facts is responsible for leading to wrong inference and often goes out of context w.r.t the question leading to semantic drifts (Fried et al., 2015). Analogous problems due to spurious facts can be seen in learning semantic parser from denotations (Guu et al., 2017) and in reinforcement learning (Sutton, 1984;Agarwal et al., 2019). The density of the graph also makes it intractable to use more than 1-hop chains and a better (sparser) graph representation will be necessary to make full use of the path ranker.
To counter the problem of spurious chains, we use the annotations present in the WorldTree corpus. Specifically, a chain of facts is a positive train- [CLS] a student wants to compare the masses and volumes of three marbles. which two instruments should be used? balance and graduated cylinder [SEP] a graduated cylinder is used to measure volume of a liquid; of an object. [SEP] centimeter is a unit of measurement BERT Encoder

CLS
Linear (768X768) Linear (768X2) Softmax Figure 3: Architecture of the fact encoder. The question, answer and the chain of facts are concatenated together and fed to a BERT encoder to form query-aware contextualized representation. The representation corresponding to the [CLS] token is then fed to another feed-forward network.
ing example, if all the facts present in the chain are relevant to the question. For example, the chain ("A balance is used for measuring mass, weight of an object", "Marble is kind of an object") is a positive training example, since both the individual facts are relevant for the question in figure 1. However, the chain ("A balance is used for measuring mass, weight of an object", "A graduated cylinder is a object") is considered as a spurious chain for the same question.
To encode the chain of facts, we use the power of contextualized representation from a pre-trained BERT language model (Devlin et al., 2018). As shown in figure 3, we form the querydependent representation of a chain of facts by concatenating the question, correct answer and the facts in the chain. After encoding with a BERT model, we use the representation corresponding to the [CLS] token as the intermediate representation which is then fed to a 2-layer feed-forward network to output a score. We use a simple binary cross entropy loss to train the network.
Our model scores chain of facts and therefore we need a way of propagating the scores to individual facts to create a ranked list. We follow the simple strategy of assigning the same score to each individual fact in the chain. The same fact can also occur in multiple chains. In that case, a fact is assigned the maximum score it gets from any chain.
We also tried a variant of our model that does not consider a chain of facts but treats and scores fact independently. Such models have been found to be effective for information retrieval (Nogueira and Cho, 2019). In this model, we concatenate the query and an individual fact and compute the relevance score for each fact in the corpus. Although simple, a drawback of this model is that it will not scale to a large corpus of facts. However, since the WorldTree corpus contains only around 5000 facts, we were able to exhaustively evaluate it. As we will show in the next section, this model is extremely competitive.

Experiments
For all experiments, the BERT encoder is initialized with the pre-trained BERT-BASE-UNCASED model available in the Python package PYTORCH-TRANSFORMERS 4 . For training, learning rate was initialized to 3e −5 , batch size of 60 and maximum sequence length of 90 (for batching). During evaluation, we increased the sequence length to 140. The entire model was fine-tuned for 1 epoch. We used Adam optimizer (Kingma and Ba, 2014) with linear learning rate decay. For generating the paths during training, the top 25 facts obtained by tf-idf based ranker were used. We used the implementation of TfidfVectorizer present in SCIKIT-LEARN (Pedregosa et al., 2011). For all our experiments, we consider chains up to length 2.
For evaluation, we report results using top 25 and top 50 tf-idf ranked facts. Using 50 starting facts improves the coverage of our method 5 but the this results in about 10,000+ chains per question making it difficult to scale. We concatenate the correct answer choice with the question. We did this because we observed that the score of tfidf based ranker is best when given only the correct choice. This makes intuitive sense since the ranker is distracted by the wrong choices and in most examples, the remaining choices are not necessary to answer the question. Our approach does not guarantee ranking of all possible facts present in the corpus. To obtain a ranking of all facts, we appended the facts that were missed in the order that they appear in the tf-idf ranking.

Baselines
We report the tf-idf based ranking scores as baselines. We experimented with three variants based on which answer choices were present in the query. We report the score of the 'no choices' variant because in a realistic setting, the correct answer will not be present at test time.
We consider a variant of our model that ranks individual facts directly rather than paths (BERT Re-ranker in the Table 1). This is the model proposed in (Nogueira and Cho, 2019) but instead of 4 https://github.com/huggingface/ pytorch-transformers 5 For 88% (75.66%) of questions, all the ground truth facts were present in the set of 1-hop neighbors from top 50 (top 25) tf-idf facts. re-ranking a subset of facts, we use it to rank all the facts. The Re-ranker was initialized and trained in the same way as our model. We found that a model fine-tuned for 3 epochs worked best. This is an extremely competitive model and outperforms our Path-ranker model. As noted before, this is not a scalable approach and can not be applied to a large corpus of facts. We also perform a type of ensemble ranking. For every question, the ranking of Path ranker is used until the probability output of the model drops below 0.5. At this point, we append the ranking of the BERT Re-ranker. The intuition behind this scheme is that we consider the ranking of the Path ranker only when it is confident (prob ≥ 0.5) and then fall back to the BERT re-ranker. Empirically, as seen in table 1, this heuristic is very effective. In addition, we perform a post processing step to move duplicate facts to the end of the ranking 6 . This post-processing also gave a slight improvement 7 .

Performance on the hidden test set
Our submission during the test phase of the shared task corresponds to a constrained version of the final model. Also, the initial set of facts are The test set performance is reported in Table 2.

Error Analysis
We perform extensive error analysis (in Appendix I) on the 23 questions in the development set  where our model performed poorly (MAP ≤ 0.25). We find that 2 of the 23 errors resulted due to the pre-processing step of removing the wrong choices. These questions required reasoning by eliminating answer candidates. In 5 of the 23 errors, (in our opinion) our model brings up alternative sentences (to the top) that provide sufficient explanation. In all other cases, when the model outputs a poor ranking, it usually outputs semantically similar facts but these facts are not necessary/helpful for completing the reasoning.
We would like to emphasize question 23 in Appendix I which proves the need for scoring chains rather than individual facts ("The snowshoe hare sheds its fur twice a year. In the summer, the fur of the hare is brown. In the winter, the fur is white. Which of these statements best explains the advantage of shedding fur? (A) Shedding fur keeps the hare clean. (B) Shedding fur helps the hare move quickly. (C) Shedding fur keeps the hares home warm. (D) Shedding fur helps the hare blend into its habitat"). In this question, understanding that brown fur is in fact an adaptation to hide from predators and not meant for warmth is dependent on making a link that the hare lives in forests which have brown bark. Although our model performs better than tf-idf on this question, since the reasoning chain is longer than one hop, the performance is poor overall.
In appendix II, we also note examples where hopping over multiple facts is necessary. BERTre-ranker scores relevant facts low when they have low lexical overlap with the question. However, our model uses chain of connected facts to identify that there are relevant to the question.

Conclusion
We describe our entry to the shared task on 'Multi-hop Inference Explanation Regeneration'. We present a system that reasons over chains of facts to reconstruct a subgraph of facts that explains an answer to the question. Our system is the winning entry to the shared task outperforming the second best system by 14.95 points in MAP score. Analysis: • TF-IDF does better because the first relevant fact is ranked higher. However, the low TFIDF fact is ranked much lower than in the learned models Analysis: • The annotation for the question marks too many facts as relevant. It is possible to generate a smaller set of facts as sufficient to answer then question -Establish rock is non-living: pebble is a kind of small rock; rock means stone; rock is a kind of nonliving thing; nonliving, non-living, die is the opposite of living, alive, live -Establish consumer eat living things: consumers eat other organisms; an organism is a living thing; a plant is a kind of organism -Establish that the plant is trying to camouflage: looking like is similar to camouflaging as; camouflage is a kind of protection against predators; from predators; against consumers • Final ranking gathers facts related to adaptation and food chains near the top 3 mcas 2011 5 17673 Question: Jose has two bar magnets. He pushes the ends of the two magnets together and then he lets go. The magnets move quickly apart. Which of the following statements best explains why this happens? (A) The north poles of the two magnets are facing each other. (B) One magnet is a north pole and one magnet is a south pole. (C) The ends of magnets repel each other but the centers attract. (D) One magnet is storing energy and one magnet is releasing energy. Analysis: • The mixed results ranks facts related to magnetism and separated object as top choices. They are similar in content but not necessary to answer the question  Analysis: • The top 2 facts ranked by the model are "iron nails are made of iron" and "an electromagnet is formed by attaching an iron nail wrapped in a copper wire to a circuit". These are sufficient to answer the question but not marked as ground truth.
6 mercury sc lbs10789 Question: All of the following can become fossils except (A) bones. (B)

Analysis:
• This question is a case where having all the choices is necessary to gather relevant facts. This is a known weakness in the preprocessing step which only keeps the correct choices. Because the rankers only see the query "All of the following can become fossils except rocks" they cannot gather facts related to bones, teeth and shells. This is reflected by very poor TF-IDF ranks • Top ranked facts in the final results relate to fossil formation and rock formation. Again, because of the same reason mentioned before, the ranker never sees the other choice terms 7 mdsa 2009 5 16 Question: Students visited the Morris W. Offit telescope located at the Maryland Space Grant Observatory in Baltimore. They learned about the stars, planets, and moon. The students recorded the information below. Star patterns stay the same, but their locations in the sky seem to change. The sun, planets, and moon appear to move in the sky. Proxima Centauri is the nearest star to our solar system. Polaris is a star that is part of a pattern of stars called the Little Dipper. Which statement best explains why the sun appears to move across the sky each day? (A) The sun revolves around Earth. (B) Earth rotates around the sun. (C) The sun revolves on its axis. (D) Earth rotates on its axis. Analysis: • Just one fact ("the Earth rotating on its axis causes the sun to appear to move across the sky during the day") is sufficient to actually answer the question • Top results are "if a human is on a rotating planet then other celestial bodies will appear to move from that human's perspective", "a star is a kind of celestial object; celestial body", "stay the same means not changing", "a telescope is used for observing stars;planets;moons;distant objects; the sky; celestial objects", "the Earth rotates on its axis on its axis", "the Sun is the star that is closest to Earth". This looks like a reasonable explanation set 8 mdsa 2010 5 18 Question: Wind is a natural resource that benefits the southeastern shore of the Chesapeake Bay. How could these winds best benefit humans? (A) The winds could blow oil spills into the bay. (B) The winds could be converted to fossil fuel. (C) The winds could blow air pollution toward land. (D) The winds could be converted to electrical energy. Analysis: • Model spuriously ranks "a human is a kind of animal" at 2 and "person is synonymous with human" at 3. All other top facts relate to energy, wind and shore. Again the model has captured context but not found facts that answer the question 9 mercury sc 400862 Question: A wire is wrapped around a metal nail and connected to a battery. If the battery is active, the nail will (A) vibrate. (B)  Analysis: • Top ranked fact is "metal is sometimes magnetic". The model pushes useful sentences away by ranking unnecessary synonymy sentences higher 10 vasol 2009 5 10 Question: A student is hiking through a forest taking pictures for science class. Which picture would most likely be used as an example of human impact on Earth? (A) A trail built by cutting down trees (B) A river eroding away the riverbank (C) A bird nest made of dead branches (D) A group of butterflies landing on flowers Correct Answer: A Ground truth facts: Analysis: • Model output shows that it has captured context but again does not mark facts required to explain answer higher Question: Hummingbirds can hover in the air and fly very quickly. This benefits the hummingbird in all of the following except (A) quickly escaping predators. (B)  Analysis: • The model output ranks "nervous system is an electrical; electric conductor"; "the nervous system is a kind of body system"; "the nervous system contains nerves"; "the nervous system is a part of the body of an animal"; "the nervous system is the vehicle for controlling the body". This is surprising since it ignores the near-perfect lexical overlap between question and ground truth 13 mercury sc 401598 Question: Which action does a kitten learn from its mother? (A) how to grow (B)  Analysis: • Top ranked facts are "an egg is a stage in the life cycle process of some animals"; "frogs lay eggs"; "a frog is a kind of animal"; "some adult animals lay eggs" • Spurious sentences like "a female insect lays eggs during the adult stage of an insect's life cycle" appear in the ranking 15 mercury sc 415366 Question: Trees need oxygen. Roots close to the surface of the ground take in the oxygen the tree needs. Which organisms help trees get oxygen? (A) woodpeckers making holes in the tree (B) earthworms making holes in the ground near the tree (C) mushrooms growing at the base of the tree (D) squirrels eating walnuts on the ground near the tree Analysis: • TF-IDF is distracted by the initial chunk of text in the question. This text does not add any value to the actual question • Top 5 outputs of our model are "runoff is when cropland water enters; runs off into bodies of water"; "runoff contains chemicals; fertilizer; pollutants; pesticides from cropland"; "runoff is a kind of water"; "harming something has a negative impact on; effect on that something"; "waste has a negative impact on the environment". This shows that our model has in fact ignored the distractors Analysis: • TF-IDF is distracted by facts relating to the definition of weathering, and the relation between the Earth and the moon • Our model actually ranks the top 2 facts as "the Moon has less water; air than Earth"; "weathering is a kind of erosion" which are both correct and useful for answering the question. It also ranks "soil erosion is when wind; moving water; gravity move soil from fields; environments" at rank 9. These would be sufficient facts to answer the question. This example shows an occurrence where the annotation/ranking process actually misses valid supporting facts 18 mcas 1998 4 8 Question: Where would it be MOST dangerous to work with electric tools? (A) in a garage (B) beside a swimming pool (C) near a television or computer (D) in a cool basement Analysis: • The fact that the nervous system conducts electricity is needed to show that the animal/human body will conduct electricity. The facts ranked low by our model are the ones related to the nervous system. Most likely the model implicitly assumes this since it has observed the fact that "if electricity flows through; is transferred through the body of an animal then that animal is electrocuted"