Towards Inference-Oriented Reading Comprehension: ParallelQA

In this paper, we investigate the tendency of end-to-end neural Machine Reading Comprehension (MRC) models to match shallow patterns rather than perform inference-oriented reasoning on RC benchmarks. We aim to test the ability of these systems to answer questions which focus on referential inference. We propose ParallelQA, a strategy to formulate such questions using parallel passages. We also demonstrate that existing neural models fail to generalize well to this setting.


Introduction
Reading Comprehension (RC) is the task of reading a body of text and answering questions about it. It requires a deep understanding of the information presented in order to reason about entities, actions, events, and their interrelationships. This necessitates language understanding skills as well as the cognitive ability to draw inferences.
Recent efforts in creating large-scale datasets have triggered a renewed interest in the RC task, with subsequent development of complex endto-end solutions featuring neural models. While these models do exceedingly well on the specific datasets they are developed for (some reaching or even surpassing human performance), they do not perform proportionally across datasets. Weissenborn et al. (2017) have shown that using a context or type matching heuristic to derive simple neural baseline architectures can achieve comparable results. Our experiments also indicate that pattern matching can work well on these datasets.
Inference, an important RC skill (Spearritt, 1972;Strange, 1980), is the ability to understand the meaning of text without all the information being stated explicitly. Table 5, Section A describes the types of inference that we may encounter while comprehending a passage along with the cues that * Equal Contribution help perform such reasoning. Although state-ofthe-art deep learning models for machine reading are believed to have such reasoning capabilities, the limited ability of these models to generalize indicates certain shortcomings. We believe that it is important to develop benchmarks which give a realistic sense of a system's RC capabilities. Thus, our goal in this paper is two-fold: Proof of Concept: We propose a method to create an RC dataset that assesses a model's ability to: • move beyond lexical pattern matching between the question and passage, • infer the correct answers to questions which contain referring expressions, and • generalize to different language styles. Analysis of Existing Models: We test three end-to-end neural MRC models, which perform well on SQuAD (Rajpurkar et al., 2016), on a few question-answer pairs generated using our methodology. We demonstrate that it is indeed difficult for these systems to answer such questions, also indicating their tendencies to resort to shallow pattern matching and overfit to training data.

Existing Datasets
In this work, we focus on datasets with multiword spans as answers rather than cloze-style RC datasets like MCTest (Richardson et al., 2013), CNN / Daily Mail (Hermann et al., 2015) and Children's Book Test (Weston et al., 2015).
The Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) was one of the first large scale RC datasets (over 100k QA pairs), where the answer to each question is a span in the given passage. For its collection, different sets of crowd-workers were asked to formulate questions and answers using passages obtained from ∼500 Wikipedia articles. However, this resulted in the questions having similar word patterns to the sentences containing the answers. We empirically demonstrate this in Table 1, where we ob-1 served that the sentence in the passage with the highest lexical similarity to the question contained the answer ∼80% of the time. Final answers tend to be short, with an average span length of around 3 tokens, and are largely entities (40.88%). Subramanian et al. (2017) and  provide evidence for regular patterns in candidate answers that neural models can exploit. We show in subsequent sections that models which perform well on SQuAD rely on lexical pattern matching, and are also not robust to variance in language style.  (Jaccard, 1912), TF-IDF overlap (Sparck Jones, 1972) and BM-25 overlap (Robertson et al., 1994) scoring metrics

Metric SQuAD
To alleviate the lack of topic diversity in SQuAD, NewsQA (Trischler et al., 2016) was created from 12,744 news articles sampled from CNN/Daily Mail. To ensure lexical diversity, one set of crowd-workers generated questions using only an abstractive summary, while the answer spans were marked in the full article by another set of crowd-workers. However, news articles tend to encourage questions that point to entities, and the dataset does not specifically focus on inference.
Determining the exact answer span is harder, but this may be due to the use of only news highlights to generate questions; this may induce noise in the answer spans marked in the news articles since the question might not be exactly apt.
To prevent annotation bias, SearchQA (Dunn et al., 2017) starts with question-answer pairs from Jeopardy! and adds documents retrieved by a search engine for each question as its context. However, the questions are mostly factoid. Kočiskỳ et al. (2017) found that 80% of answers are bigrams or unigrams, and 99% contain 5 or fewer tokens, with many answers being named entities. TriviaQA (Joshi et al., 2017) similarly includes question-answer pairs authored by trivia enthusiasts along with independently-gathered evidence documents which provide distant supervision for answering the questions.
These datasets have facilitated the development of new QA models, but we believe there are sev-eral important aspects of RC that remain untested.

ParallelQA
In an RC task, there is a need to incorporate questions that require not just lexical and syntactic prowess, but reference resolution, multiple steps of reasoning, and use of world knowledge. These capabilities ultimately lead to global rather than sentence-level understanding of text. The construction of a large-scale dataset of this nature is a challenging task. We take a small step in this direction by focusing on referential inference. 1 WikiHop (Welbl et al., 2017) is an interesting multi-hop inference-focused dataset created using entity-relation pairs for queries spanning different Wikipedia passages. While the focus of our pilot study is similar to theirs, we believe that our method can easily be extended to other inference types. Also, identifying the correct span is more challenging than choosing an answer from a list.
We aim to incorporate multiple language styles, making it hard for the system to memorize linguistic patterns (Williams et al., 2017). We achieve this by using two parallel passages that talk about the same or related subject(s) but are obtained from different sources. This helps in formulating referential inference questions because there exists no single sentence in the passage which matches a paraphrase of the question, and necessitates that inference (which goes beyond co-reference) be performed across both passages. Evaluation is easy and objective because answers are still spans within the passages. Questions can be answered solely on the basis of the information provided in their accompanying passages.
For example, to answer Question 1 in Table 2, the system will have to infer from passage 1 that President Kamazu Banda belongs to the MCP and was defeated in the elections. The equivalence of this event and the election in passage 2 must be established, while comprehending that the "favored challenger" Bakili Muluzi is the one Banda lost the elections to, and who belonged to the UDP, making it the correct answer.
Given that the information is spatially scattered across the two passages, this method would ensure that the parallel passages have to be understood in combination to answer the question.
Hastings Kamuzu Banda was the leader of Malawi from 1961 to 1994. In 1963 he was formally appointed prime minister of Nyasaland and, a year later, led the country to independence as Malawi. Two years later he proclaimed Malawi a republic with himself as president. He declared Malawi a one-party state under the Malawi Congress Party (MCP) and became President of MCP as well as President for Life of Malawi in 1971. A referendum ended his one-party state and a special assembly ended his life-term presidency, stripping him of most of his powers. Banda ran for president in the democratic elections which followed and was defeated. He died in South Africa in 1997.
Malawians Saturday wound up an historic election campaign bringing multiparty politics to a country ruled for the past three decades by President Hastings Kamuzu Banda. The ailing president inspected troops from an open truck as some 20,000 people turned up at a stadium here to celebrate his official birthday ahead of elections on May 17. Reading a prepared speech with some difficulty, Banda appealed to Malawians to conduct themselves "as ladies and gentlemen" during the elections, which should be "free and fair." Meanwhile, the bigger opposition rally was addressed by the presidential challenger favored to win the elections, Bakili Muluzi of the United Democratic Front (UDF).
Question 1: Who emerged victorious between the MCP and UDF?
Question 2: What did the MCP leader ask of the people of Malawi on polling day? Question 3: What brought multiparty politics to Malawi after three decades?

Proof of Concept
For a fair evaluation of existing models, we sought to use data drawn from a similar domain, but written in a different style. We chose the CNN/Daily Mail corpus and Wikipedia because they both focus on factoid statements, yet differ in language style to a noticeable extent (e.g. in the use of idiomatic expressions). We picked 20 CNN/Daily Mail articles at random to form one of the passages in our pair. To find an associated parallel passage, we selected the most frequently mentioned entities in each article and obtained its corresponding Wikipedia pages. We fragmented these into passages with at most 500 words, and performed a k-Nearest Neighbor search using tf-idf and topic vectors (Blei et al., 2003) to form pairs. We tuned the number of entities per article used to retrieve Wikipedia pages, as well as the sections considered in each article. This process produced a total of 15 News-Wiki passage pairs. While no two pairs have the same news article, they may be paired with the same Wiki passage.
We focused on referential inference for this pilot, but the method can be extended to include questions based on other types of inference. 15 human annotators were given explicit instructions and real-world examples to form question-answer pairs using given parallel passages. We collected ∼ 50 valid question-answer pairs through this mechanism. The average length of the answers obtained was around 4 words. Basic sentence retrieval statistics (similar to the ones discussed in Section 2) are shown in Table 1, indicating that lexical similarity between the question and passage sentences is insufficient to obtain an answer.
Our small-scale experiment shows the feasibility of the approach, although collecting a larger dataset requires more effort in acquiring passages and generating questions from diverse sources.   (Seo et al., 2016), Document Reader (DrQA) 3 (Chen et al., 2017), and Gated Self-Matching Networks (R-Net) 4  trained on SQuAD. We feed the concatenated parallel passage and the question as inputs. On a total of 51 QA pairs, we observed exact match (EM) scores of about 40% and token overlap F1 scores of about 45% for all models, versus their performance on the SQuAD dataset (EM of almost 70% and F1 of 80%). Detailed results are shown in Table 3.

Analysis of Existing Models
Although the models were trained and tested on different datasets, we expect them to perform reasonably well on the new task since the data sources and domain are similar. Also, the size of our collected data is much smaller than the SQuAD development set, but we believe that the samples are fairly representative of data that can be generated 2 https://allenai.github.io/bi-att-flow/ 3 https://github.com/hitvoice/DrQA 4 https://github.com/HKUST-KnowComp/R-Net Passage Question ...The UN is the largest, most familiar, most internationally represented and most powerful intergovernmental organisation in the world...UN envoy Yasushi Akashi called a meeting of all parties to talks on a four-month ceasefire for Saturday afternoon, he added...
Who was sent to Bosnia as the envoy of most powerful intergovernmental organisation in the world? ...On arrival, the president and his wife Hillary were taken to University College, one of 37 Oxford colleges, where he studied political science as a Rhodes Scholar between October 1968 and June 1970... Clinton was born and raised in Arkansas and ...  Table 4: Examples of error trends on ParallelQA: blue -gold answer, red -span predicted incorrectly by all models, orange -BiDAF and R-Net prediction overlap, olive -BiDAF, magenta -DrQA, cyan -R-Net using our proposed mechanism. Thus, the low EM and F1 scores support our hypothesis that these datasets do not adequately assess the capabilities of these models, which overfit to lexical patterns rather than generalizing.

From which state was this US President who was a Rhodes scholar be
We now discuss a few common errors observed upon manual inspection of the results. Examples for each are provided in Table 4. The distribution of predictions across these error categories can be found in Figure 1, Section A.
• High Lexical Overlap -Incorrect Sentence: The models tend to pick answer spans from sentences which have high lexical overlap with the question. We observe that this accounts for the largest chunk of errors across all models (example 2). Our observations are consistent with the findings of Jia and Liang (2017). The models often simply resolve the referential expression in the question to its corresponding entity. In example 1, the models resolve "organisation" in the question to "The UN" due to high lexical similarity.
• Incorrect Answer Boundaries: This is the second most frequently observed error, where the answers generated are almost correct, but models face issues in appropriately defining answer boundaries (example 3). R-Net and DrQA, on average, produce shorter answers. BiDAF tends to produce longer answers.
• Missing Logical Inference: Models are sometimes unable to make certain logical conclusions like A's victory over B implies that B lost to A (example 4).
• Entity Type Confusion: Despite having a variety of entities as answers to questions in the training data, sometimes the model answers do not correspond to the correct entity type (example 5).

Discussion & Conclusion
While our approach is promising, we observed a few problems during the pilot study. Longer passages and constraints on the question formulation require more time and skill in the annotation process. This can lead to crowd-workers formulating a single referring expression and then using it in different contexts to form questions, reducing diversity. For some questions, although inference is needed, both passages may not be necessary to answer them. Since we used news articles and Wikipedia passages in our pilot study, 58.82% of answers were named entities. We plan to extend this mechanism to other inference types and conduct a larger pilot before scaling up the collection. Our experiments demonstrate that the Paral-lelQA task can be more challenging than some prior QA tasks. Our analysis shows that many popular RC datasets seem to test the ability of models to pick up superficial cues. ParallelQA is our proposed step towards inference-oriented reading comprehension. We use parallel passages from different sources for generating reasoning questions which encourage systems to gain a deeper understanding of language, and become robust to variations in style and topic. We include examples from our initial pilot study in Table 6   But it was essentially a baseline slogging match which provided little to whet the appetite for Wimbledon. There were no breaks of serve in either set and only three break points in the entire match -two against Sampras in the second game and one against Martin in the next. Martin clinched the first tie-break courtesy of a double fault from Sampras to lead 4-2 and then a glorious cross-court forehand return on his second set point to take the shoot-out 7-4. He took the second tie-break by the same score, Sampras saving three match points before a fierce smash clinched Martin's third career title and his first victory over his compatriot in four meetings.
Petros "Pete" Sampras (born August 12, 1971) is a retired American tennis player widely regarded as one of the greatest in the history of the sport. He was a longtime world No. 1 with a precise serve that earned him the nickname "Pistol Pete". Cambodian co-premiers Prince Norodom Ranariddh and Hun Sen said Wednesday they had agreed to holding peace talks with the Khmer Rouge in Pyongyang without preconditions, in response to an appeal by King Norodom Sihanouk. The co-premiers had sent an official letter to the king "saying that we are ready to go to Pyongyang without ceasefire, without preconditions," Prince Ranariddh told journalists. "Let talks begin," he added. Hun Sen said the talks, beginning on May 27, would be based on a peace plan put forward by King Sihanouk, but added that the government had yet to receive a reply from the Khmer Rouge regarding the proposal. King Sihanouk has proposed that certain "acceptable" members of the Khmer Rouge be given senior cabinet posts in the government in exchange for giving up their zones, ceasing all guerrilla activities and merging their fighters with the royal armed forces.
Hun Sen is the Prime Minister of Cambodia, President of the Cambodian People's Party (CPP), and Member of Parliament (MP) for Kandal. He has served as Prime Minister since 1985, making him the longest serving head of government of Cambodia, and one of the longest serving leaders in the world. From 1979 to 1986 and again from 1987 to 1990, Hun Sen served as Cambodia's foreign minister. His full honorary title is Samdech Akeak Moha Sena Padey Techo Hun Sen. Born Hun Bunal, he changed his name to Hun Sen in 1972 two years after joining the Khmer Rouge. Hun Sen rose to the premiership in January 1985 when the one-party National Assembly appointed him to succeed Chan Sy who had died in office in December 1984. He held the position until the 1993 UN-backed elections, which resulted in a hung parliament. After contentious negotiations with the FUNCINPEC, Hun Sen was accepted as Second Prime Minister, serving alongside Norodom Ranariddh until a 1997 coup which toppled the latter. Ung Huot was then selected to succeed Ranariddh.