Exploiting Linguistic Features for Sentence Completion

This paper presents a novel approach to automated sentence completion based on pointwise mutual information (PMI). Feature sets are created by fusing the various types of input provided to other classes of language models, ultimately allowing multiple sources of both local and distant information to be considered. Furthermore, it is shown that additional precision gains may be achieved by incorporating feature sets of higher-order n-grams. Experimental results demonstrate that the PMI model outperforms all prior models and establishes a new state-of-the-art result on the Microsoft Research Sentence Completion Challenge


Introduction
Skilled reading is a complex cognitive process that requires constant interpretation and evaluation of written content. To develop a coherent picture, one must reason from the material encountered to construct a mental representation of meaning. As new information becomes available, this representation is continually refined to produce a globally consistent understanding. Sentence completion questions, such as those previously featured on the Scholastic Aptitude Test (SAT), were designed to assess this type of verbal reasoning ability. Specifically, given a sentence containing 1-2 blanks, the test taker was asked to select the correct answer choice(s) from the provided list of options (College Board, 2014). A sample sentence completion question is illustrated in Figure 1.
To date, relatively few publications have focused on automatic methods for solving sentence completion questions. This scarcity is likely attributable to the difficult nature of the task, which Certain clear patterns in the metamorphosis of a butterfly indicate that the process is ---.  occasionally involves logical reasoning in addition to both general and semantic knowledge (Zweig et al., 2012b). Fundamentally, text completion is a challenging semantic modeling problem, and solutions require models that can evaluate the global coherence of sentences (Gubbins and Vlachos, 2013). Thus, in many ways, text completion epitomizes the goals of natural language understanding, as superficial encodings of meaning will be insufficient to determine which responses are accurate.
In this paper, a model based on pointwise mutual information (PMI) is proposed to measure the degree of association between answer options and other sentence tokens. The PMI model considers multiple sources of information present in a sentence prior to selecting the most likely alternative.
The remainder of this report is organized as follows. Section 2 describes the high-level characteristics of existing models designed to perform automated sentence completion. This prior work provides direct motivation for the PMI model, introduced in Section 3. In Section 4, the model's performance on the Microsoft Research (MSR) Sentence Completion Challenge and a data set comprised of SAT questions are juxtaposed. Finally, Section 5 offers concluding remarks on this topic.

Background
Previous research expounds on various architectures and techniques applied to sentence completion. Below, models are roughly categorized on the basis of complexity and type of input analyzed.

N-gram Models
Advantages of n-gram models include their ability to estimate the likelihood of particular token sequences and automatically encode word ordering. While relatively simple and efficient to train on large, unlabeled text corpora, n-gram models are nonetheless limited by their dependence on local context. In fact, such models are likely to overvalue sentences that are locally coherent, yet improbable due to distant semantic dependencies.

Dependency Models
Dependency models circumvent the sequentiality limitation of n-gram models by representing each word as a node in a multi-child dependency tree. Unlabeled dependency language models assume that each word is (1) conditionally independent of the words outside its ancestor sequence, and (2) generated independently from the grammatical relations. To account for valuable information ignored by this model, e.g., two sentences that differ only in a reordering between a verb and its arguments, the labeled dependency language model instead treats each word as conditionally independent of the words and labels outside its ancestor path (Gubbins and Vlachos, 2013).
In addition to offering performance superior to n-gram models, advantages of this representation include relative ease of training and estimation, as well as the ability to leverage standard smoothing methods. However, the models' reliance on output from automatic dependency extraction methods and vulnerability to data sparsity detract from their real-world practicality.

Continuous Space Models
Neural networks mitigate issues with data sparsity by learning distributed representations of words, which have been shown to excel at preserving linear regularities among tokens. Despite drawbacks that include functional opacity, propensity toward overfitting, and elevated computational demands, neural language models are capable of outperforming n-gram and dependency models (Gubbins and Vlachos, 2013;Mikolov et al., 2013;Mnih and Kavukcuoglu, 2013). Log-linear model architectures have been proposed to address the computational cost associated with neural networks (Mikolov et al., 2013;Mnih and Kavukcuoglu, 2013). The continuous bag-ofwords model attempts to predict the current word using n future and n historical words as context. In contrast, the continuous skip-gram model uses the current word as input to predict surrounding words. Utilizing an ensemble architecture comprised of the skip-gram model and recurrent neural networks, Mikolov et al. (2013) achieved prior state-of-the-art performance on the MSR Sentence Completion Challenge.

PMI Model
This section describes an approach to sentence completion based on pointwise mutual information. The PMI model was designed to account for both local and distant sources of information when evaluating overall sentence coherence.
Pointwise mutual information is an information-theoretic measure used to discover collocations (Church and Hanks, 1990;Turney and Pantel, 2010). Informally, PMI represents the association between two words, i and j, by comparing the probability of observing them in the same context with the probabilities of observing each independently.
The first step toward applying PMI to the sentence completion task involved constructing a word-context frequency matrix from the training corpus. The context was specified to include all words appearing in a single sentence, which is consistent with the hypothesis that it is necessary to examine word co-occurrences at the sentence level to achieve appropriate granularity. During development/test set processing, all words were converted to lowercase and stop words were removed based on their part-of-speech tags (Toutanova et al., 2003). To determine whether a particular part-of-speech tag type did, in fact, signal the presence of uninformative words, tokens assigned a hypothetically irrelevant tag were removed if their omission positively affected performance on the development portion of the MSR data set. This non-traditional approach, selected to increase specificity and eliminate dependence on a non-universal stop word list, led to the removal of determiners, coordinating conjunctions, Figure 2: The dependency parse tree for Question 17 in the MSR data set. Words that share a grammatical relationship with the missing word rising are underscored. Following stop word removal, the feature set for this question is [darkness, was, hidden].
pronouns, and proper nouns. 1 Next, feature sets were defined to capture the various sources of information available in a sentence. While feature set number and type is configurable, composition varies, as sets are dynamically generated for each sentence at run time. Enumerated below are the three feature sets utilized by the PMI model.
1. Reduced Context. This feature set consists of words that remain following the preprocessing steps described above.

2.
Dependencies. Sentence words that share a semantic dependency with the candidate word(s) are included in this set (Chen and Manning, 2014). Absent from the set of dependencies are words removed during the pre-processing phase. Figure 2 depicts an example dependency parse tree along with features provided to the PMI model.

3.
Keywords. Providing the model with a collection of salient tokens effectively increases the tokens' associated weights. An analogous approach to the one described for stop word identification was applied to discover that common nouns consistently hold greater significance than other words assigned hypothetically informative part-of-speech tags.
Let X represent a word-context matrix with n rows and m columns. Row x i: corresponds to word i and column x :j refers to context j. The term x(i,j) indicates how many times word i occurs in context j. Applying PMI to X results in the n x m matrix Y, where term y(i,j) is defined by (1). To avoid overly penalizing words that are unrelated to the context, the positive variant of PMI is considered, in which negative scores are replaced with zero (4).
In addition, the discounting factor described by Pantel and Lin (2002) is applied to reduce bias toward infrequent words (7).
The PMI model evaluates each possible response to a sentence completion question by substituting each candidate answer, i, in place of the blank and scoring the option according to (8). This equation measures the semantic similarity between each candidate answer and all other words in the sentence, S. Prior to being summed, individual PMI values associated with a particular word i and context word j are multiplied by γ, which reflects the number of feature sets containing j. Ultimately, the candidate option with the highest similarity score is selected as the most likely answer.
Using the procedure described above, additional feature sets of bigrams and trigrams were created and subsequently incorporated into the semantic similarity assessment. This extended model accounts for both word-and phraselevel information by considering windowed cooccurrence statistics.

Data Sets
Since its introduction, the Microsoft Research Sentence Completion Challenge (Zweig and Burges, 2012a) has become a commonly used benchmark for evaluating semantic models. The data is comprised of material from nineteenthcentury novels featured on Project Gutenberg. Each of the 1,040 test sentences contains a single blank that must be filled with one of five candidate words. Associated candidates consist of the correct word and decoys with similar distributional statistics.
To further validate the proposed method, 285 sentence completion problems were collected from SAT practice examinations given from 2000(College Board, 2014. While the MSR data set includes a list of specified training texts, there is no comparable material for SAT questions. Therefore, the requisite word-context matrices were constructed by computing token cooccurrence frequencies from the New York Times portion of the English Gigaword corpus (Parker et al., 2009).

Results
The overall accuracy achieved on the MSR and SAT data sets reveals that the PMI model is able to outperform prior models applied to sentence completion. Table 1 provides a comparison of the accuracy values attained by various architectures, while Table 2 summarizes the PMI model's performance given feature sets of context words, dependencies, and keywords. Recall that the n-gram variant reflects how features are partitioned.
It appears that while introducing phrase-level information obtained from higher-order n-grams leads to gains in precision on the MSR data set, the same cannot be stated for the set of SAT ques-Language Model MSR Random chance 20.00 N-gram [Zweig (2012b)] 39.00 Skip-gram [Mikolov (2013)] 48.00 LSA [Zweig (2012b)] 49.00 Labeled Dependency [Gubbins (2013)] 50.00 Dependency RNN [Mirowski (2015)] 53.50 RNNs [Mikolov (2013)] 55.40 Log-bilinear [Mnih (2013)] 55.50 Skip-gram + RNNs [Mikolov (2013)] 58.90 PMI 61.44  tions. The most probable explanation for this is twofold. First, informative context words are much less likely to occur within 2-3 tokens of the target word. Second, missing words, which are selected to test knowledge of vocabulary, are rarely found in the training corpus. Bigrams and trigrams containing these infrequent terms are extremely uncommon. Regardless of sentence structure, the sparsity associated with higher-order ngrams guarantees diminishing returns for larger values of n. When deciding whether or not to incorporate this information, it is also important to consider the significant trade-off with respect to information storage requirements.

Conclusion
This paper described a novel approach to answering sentence completion questions based on pointwise mutual information. To capture unique information stemming from multiple sources, several features sets were defined to encode both local and distant sentence tokens. It was shown that while precision gains can be achieved by augmenting these feature sets with higher-order n-grams, a significant cost is incurred as a result of the increased data storage requirements. Finally, the superiority of the PMI model is demonstrated by its performance on the Microsoft Research Sentence Completion Challenge, during which a new stateof-the-art result was established.