Let’s do it “again”: A First Computational Approach to Detecting Adverbial Presupposition Triggers

We introduce the novel task of predicting adverbial presupposition triggers, which is useful for natural language generation tasks such as summarization and dialogue systems. We introduce two new corpora, derived from the Penn Treebank and the Annotated English Gigaword dataset and investigate the use of a novel attention mechanism tailored to this task. Our attention mechanism augments a baseline recurrent neural network without the need for additional trainable parameters, minimizing the added computational cost of our mechanism. We demonstrate that this model statistically outperforms our baselines.


Introduction
In pragmatics, presuppositions are assumptions or beliefs in the common ground between discourse participants when an utterance is made (Frege, 1892;Strawson, 1950;Stalnaker, 1973Stalnaker, , 1998, and are ubiquitous in naturally occurring discourses (Beaver and Geurts, 2014). Presuppositions underly spoken statements and written sentences and understanding them facilitates smooth communication. We refer to expressions that indicate the presence of presuppositions as presupposition triggers. These include definite descriptions, factive verbs and certain adverbs, among others. For example, consider the following statements: (1) John is going to the restaurant again. * Authors (listed in alphabetical order) contributed equally.
(2) John has been to the restaurant.
(1) is only appropriate in the context where (2) is held to be true because of the presence of the presupposition trigger again. One distinguishing characteristic of presupposition is that it is unaffected by negation of the presupposing context, unlike other semantic phenomena such as entailment and implicature. The negation of (1), John is not going to the restaurant again., also presupposes (2).
Our focus in this paper is on adverbial presupposition triggers such as again, also and still. Adverbial presupposition triggers indicate the recurrence, continuation, or termination of an event in the discourse context, or the presence of a similar event. In one study of presuppositional triggers in English journalistic texts (Khaleel, 2010), adverbial triggers were found to be the most commonly occurring presupposition triggers after existential triggers. 1 Despite their frequency, there has been little work on these triggers in the computational literature from a statistical, corpus-driven perspective.
As a first step towards language technology systems capable of understanding and using presuppositions, we propose to investigate the detection of contexts in which these triggers can be used. This task constitutes an interesting testing ground for pragmatic reasoning, because the cues that are indicative of contexts containing recurring or similar events are complex and often span more than one sentence, as illustrated in Sentences (1) and (2). Moreover, such a task has immediate practical consequences. For example, in language generation applications such as summarization and dialogue systems, adding presuppositional triggers in contextually appropriate loca-1 Presupposition of existence are triggered by possessive constructions, names or definite noun phrases. tions can improve the readability and coherence of the generated output.
We create two datasets based on the Penn Treebank corpus (Marcus et al., 1993) and the English Gigaword corpus (Graff et al., 2007), extracting contexts that include presupposition triggers as well as other similar contexts that do not, in order to form a binary classification task. In creating our datasets, we consider a set of five target adverbs: too, again, also, still, and yet. We focus on these adverbs in our investigation because these triggers are well known in the existing linguistic literature and commonly triggering presuppositions. We control for a number of potential confounding factors, such as class balance, and the syntactic governor of the triggering adverb, so that models cannot exploit these correlating factors without any actual understanding of the presuppositional properties of the context.
We test a number of standard baseline classifiers on these datasets, including a logistic regression model and deep learning methods based on recurrent neural networks (RNN) and convolutional neural networks (CNN).
In addition, we investigate the potential of attention-based deep learning models for detecting adverbial triggers. Attention is a promising approach to this task because it allows a model to weigh information from multiple points in the previous context and infer long-range dependencies in the data (Bahdanau et al., 2015). For example, the model could learn to detect multiple instances involving John and restaurants, which would be a good indication that again is appropriate in that context. Also, an attention-based RNN has achieved success in predicting article definiteness, which involves another class of presupposition triggers (Kabbara et al., 2016).
As another contribution, we introduce a new weighted pooling attention mechanism designed for predicting adverbial presupposition triggers. Our attention mechanism allows for a weighted averaging of our RNN hidden states where the weights are informed by the inputs, as opposed to a simple unweighted averaging. Our model uses a form of self-attention (Paulus et al., 2018;Vaswani et al., 2017), where the input sequence acts as both the attention mechanism's query and key/value. Unlike other attention models, instead of simply averaging the scores to be weighted, our approach aggregates (learned) attention scores by learning a reweighting scheme of those scores through another level (dimension) of attention. Additionally, our mechanism does not introduce any new parameters when compared to our LSTM baseline, reducing its computational impact. We compare our model using the novel attention mechanism against the baseline classifiers in terms of prediction accuracy. Our model outperforms these baselines for most of the triggers on the two datasets, achieving 82.42% accuracy on predicting the adverb "also" on the Gigaword dataset.
The contributions of this work are as follows: 1. We introduce the task of predicting adverbial presupposition triggers.
2. We present new datasets for the task of detecting adverbial presupposition triggers, with a data extraction method that can be applied to other similar pre-processing tasks.
3. We develop a new attention mechanism in an RNN architecture that is appropriate for the prediction of adverbial presupposition triggers, and show that its use results in better prediction performance over a number of baselines without introducing additional parameters.
2 Related Work

Presupposition and pragmatic reasoning
The discussion of presupposition can be traced back to Frege's work on the philosophy of language (Frege, 1892), which later leads to the most commonly accepted view of presupposition called the Frege-Strawson theory (Kaplan, 1970;Strawson, 1950). In this view, presuppositions are preconditions for sentences/statements to be true or false. To the best of our knowledge, there is no previous computational work that directly investigates adverbial presupposition. However in the fields of semantics and pragmatics, there exist linguistic studies on presupposition that involve adverbs such as "too" and "again" (e.g., (Blutner et al., 2003), (Kang, 2012)) as a pragmatic presupposition trigger. Also relevant to our work is (Kabbara et al., 2016), which proposes using an attention-based LSTM network to predict noun phrase definiteness in English. Their work demonstrates the ability of these attention-based models to pick up on contextual cues for pragmatic reasoning.
Many different classes of construction can trigger presupposition in an utterance, this includes but is not limited to stressed constituents, factive verbs, and implicative verbs (Zare et al., 2012). In this work, we focus on the class of adverbial presupposition triggers.
Our task setup resembles the Cloze test used in psychology (Taylor, 1953;E. B. Coleman, 1968;Earl F. Rankin, 1969) and machine comprehension (Riloff and Thelen, 2000), which tests text comprehension via a fill-in-the-blanks task. We similarly pre-process our samples such that they are roughly the same length, and have equal numbers of negative samples as positive ones. However, we avoid replacing the deleted words with a blank, so that our model has no clue regarding the exact position of the possibly missing trigger. Another related work on the Children's Book Test (Hill et al., 2015) notes that memories that encode sub-sentential chunks (windows) of informative text seem to be most useful to neural networks when interpreting and modelling language. Their finding inspires us to run initial experiments with different context windows and tune the size of chunks according to the Logistic Regression results on the development set.

Attention
In the context of encoder-decoder models, attention weights are usually based on an energy measure of the previous decoder hidden state and encoder hidden states. Many variations on attention computation exist. Sukhbaatar et al. (2015) propose an attention mechanism conditioned on a query and applied to a document. To generate summaries, Paulus et al. (2018) add an attention mechanism in the prediction layer, as opposed to the hidden states. Vaswani et al. (2017) suggest a model which learns an input representation by self-attending over inputs. While these methods are all tailored to their specific tasks, they all inspire our choice of a self-attending mechanism.

Corpora
We extract datasets from two corpora, namely the Penn Treebank (PTB) corpus (Marcus et al., 1993) and a subset (sections 000-760) of the third edition of the English Gigaword corpus (Graff et al., 2007). For the PTB dataset, we use sections 22 and 23 for testing. For the Gigaword corpus, we use sections 700-760 for testing. For the remaining data, we randomly chose 10% of them for development, and the other 90% for training. For each dataset, we consider a set of five target adverbs: too, again, also, still, and yet. We choose these five because they are commonly used adverbs that trigger presupposition. Since we are concerned with investigating the capacity of attentional deep neural networks in predicting the presuppositional effects in general, we frame the learning problem as a binary classification for predicting the presence of an adverbial presupposition (as opposed to the identity of the adverb).
On the Gigaword corpus, we consider each adverb separately, resulting in five binary classification tasks. This was not feasible for PTB because of its small size.
Finally, because of the commonalities between the adverbs in presupposing similar events, we create a dataset that unifies all instances of the five adverbs found in the Gigaword corpus, with a label "1" indicating the presence of any of these adverbs.

Data extraction process
We define a sample in our dataset as a 3-tuple, consisting of a label (representing the target adverb, or 'none' for a negative sample), a list of tokens we extract (before/after the adverb), and a list of corresponding POS tags (Klein and Manning, 2002). In each sample, we also add a special token "@@@@" right before the head word and the corresponding POS tag of the head word, both in positive and negative cases. We add such special tokens to identify the candidate context in the passage to the model. Figure 1 shows a single positive sample in our dataset.
We first extract positive contexts that contain a triggering adverb, then extract negative contexts that do not, controlling for a number of potential confounds. Our positive data consist of cases where the target adverb triggers presupposition by modifying a certain head word which, in most cases, is a verb. We define such head word as a governor of the target adverb.
When extracting positive data, we scan through all the documents, searching for target adverbs. For each occurrence of a target adverb, we store the location and the governor of the adverb. Taking each occurrence of a governor as a pivot, we extract the 50 unlemmatized tokens preceding it, together with the tokens right after it up to the end of the sentence (where the adverb is)-with the adverb itself being removed. If there are less than 50 tokens before the adverb, we simply extract all of these tokens. In preliminary testing using a logistic regression classifier, we found that limiting the size to 50 tokens had higher accuracy than 25 or 100 tokens. As some head words themselves are stopwords, in the list of tokens, we do not remove any stopwords from the sample; otherwise, we would lose many important samples.
We filter out the governors of "too" that have POS tags "JJ" and "RB" (adjectives and adverbs), because such cases corresponds to a different sense of "too" which indicates excess quantity and does not trigger presupposition (e.g., "rely too heavily on", "it's too far from").
After extracting the positive cases, we then use the governor information of positive cases to extract negative data. In particular, we extract sentences containing the same governors but not any of the target adverbs as negatives. In this way, models cannot rely on the identity of the governor alone to predict the class. This procedure also roughly balances the number of samples in the positive and negative classes.
For each governor in a positive sample, we locate a corresponding context in the corpus where the governor occurs without being modified by any of the target adverbs. We then extract the surrounding tokens in the same fashion as above. Moreover, we try to control positionrelated confounding factors by two randomization approaches: 1) randomize the order of documents to be scanned, and 2) within each document, start scanning from a random location in the document. Note that the number of negative cases might not be exactly equal to the number of negative cases in all datasets because some governors appearing in positive cases are rare words, and we're unable to find any (or only few) occurrences that match them for the negative cases.

Learning Model
In this section, we introduce our attention-based model. At a high level, our model extends a bidirectional LSTM model by computing correlations between the hidden states at each timestep, then applying an attention mechanism over these correlations. Our proposed weighted-pooling (WP) neural network architecture is shown in Figure 2.
The input sequence u = {u 1 , u 2 , . . . , u T } consists of a sequence, of time length T , of onehot encoded word tokens, where the original tokens are those such as in Listing 1. Each token u t is embedded with pretrained embedding matrix W e ∈ R |V |×d , where |V | corresponds to the number of tokens in vocabulary V , and d defines the size of the word embeddings. The embedded token vector x t ∈ R d is retrieved simply with x t = u t W e . Optionally, x t may also include the token's POS tag. In such instances, the embedded token at time step t is concatenated with the POS tag's one-hot encoding p t : x t = u t W e ||p t , where || denotes the vector concatenation operator.
At each input time step t, a bi-directional LSTM (Hochreiter and Schmidhuber, 1997) encodes x t into hidden state h t ∈ R s : is computed by the forward LSTM, and ← − h t = f (x t , h t+1 ) is computed by the backward LSTM. Concatenated vector h t is of size 2s, where s is a hyperparameter determining the size of the LSTM hidden states. Let matrix H ∈ R 2s×T correspond to the concatenation of all hidden state vectors: (2) Our model uses a form of self-attention (Paulus et al., 2018;Vaswani et al., 2017), where the input sequence acts as both the attention mechanism's query and key/value. Since the location of a presupposition trigger can greatly vary from one sample to another, and because dependencies can be long range or short range, we model all possible word-pair interactions within a sequence. We calculate the energy between all input tokens with a Figure 2: Our weighted-pooling neural network architecture (WP). The tokenized input is embedded with pretrained word embeddings and possibly concatenated with one-hot encoded POS tags. The input is then encoded with a bi-directional LSTM, followed by our attention mechanism. The computed attention scores are then used as weights to average the encoded states, in turn connected to a fully connected layer to predict presupposition triggering.
pair-wise matching matrix: where M is a square matrix ∈ R T ×T . To get a single attention weight per time step, we adopt the attention-over-attention method (Cui et al., 2017).
With matrix M , we first compute row-wise attention score M r ij over M : where e ij = M ij . M r can be interpreted as a word-level attention distribution over all other words. Since we would like a single weight per word, we need an additional step to aggregate these attention scores. Instead of simply averaging the scores, we follow (Cui et al., 2017)'s approach which learns the aggregation by an additional attention mechanism. We compute columnwise softmax M c ij over M : The columns of M r are then averaged, forming vector β ∈ R T . Finally, β is multiplied with the column-wise softmax matrix M c to get attention vector α: Note Equations (2) to (6) have described how we derived an attention score over our input without the introduction of any new parameters, potentially minimizing the computational effect of our attention mechanism.
As a last layer to their neural network, Cui et al. (2017) sum over α to extract the most relevant input. However, we use α as weights to combine all of our hidden states h t : where c ∈ R s . We follow the pooling with a dense layer z = σ(W z c + b z ), where σ is a non-linear function, matrix W z ∈ R 64×s and vector b z ∈ R 64 are learned parameters. The presupposition trigger probability is computed with an affine transform followed by a softmax: where matrix W o ∈ R 2×64 and vector b o ∈ R 2 are learned parameters. The training objective minimizes: where E(· , ·) is the standard cross-entropy.

Experiments
We compare the performance of our WP model against several models which we describe in this section. We carry out the experiments on both datasets described in Section 3. We also investigate the impact of POS tags and attention mechanism on the models' prediction accuracy.

Baselines
We compare our learning model against the following systems. The first is the most-frequentclass baseline (MFC) which simply labels all samples with the most frequent class of 1. The second is a logistic regression classifier (LogReg), in which the probabilities describing the possible outcomes of a single input x is modeled using a logistic function. We implement this baseline classifier with the scikit-learn package (Pedregosa et al., 2011), with a CountVectorizer including bi-gram features. All of the other hyperparameters are set to default weights.
The third is a variant LSTM recurrent neural network as introduced in (Graves, 2013). The input is encoded by a bidirectional LSTM like the WP model detailed in Section 4. Instead of a self-attention mechanism, we simply mean-pool matrix H, the concatenation of all LSTM hidden states, across all time steps. This is followed by a fully connected layer and softmax function for the binary classification. Our WP model uses the same bidirectional LSTM as this baseline LSTM, and has the same number of parameters, allowing for a fair comparison of the two models. Such a standard LSTM model represents a state-of-the-art language model, as it outperforms more recent models on language modeling tasks when the number of model parameters is controlled for (Melis et al., 2017). For the last model, we use a slight variant of the CNN sentence classification model of (Kim, 2014) based on the Britz tensorflow implementation 2 .

Hyperparameters & Additional Features
After tuning, we found the following hyperparameters to work best: 64 units in fully connected layers and 40 units for POS embeddings. We used dropout with probability 0.5 and mini-batch size of 64.
For all models, we initialize word embeddings with word2vec  pretrained embeddings of size 300. Unknown words are randomly initialized to the same size as the word2vec embeddings. In early tests on the development datasets, we found that our neural networks would consistently perform better when fixing the word embeddings. All neural network performance reported in this paper use fixed embeddings.
Fully connected layers in the LSTM, CNN and WP model are regularized with dropout (Srivastava et al., 2014). The model parameters for these neural networks are fine-tuned with the Adam algorithm (Kingma and Ba, 2015). To stabilize the RNN training gradients (Pascanu et al., 2013), we perform gradient clipping for gradients below threshold value -1, or above 1. To reduce overfitting, we stop training if the development set does not improve in accuracy for 10 epochs. All performance on the test set is reported using the best trained model as measured on the development set.
In addition, we use the CoreNLP Part-of-  Table 2: Performance of various models, including our weighted-pooled LSTM (WP). MFC refers to the most-frequent-class baseline, LogReg is the logistic regression baseline. LSTM and CNN correspond to strong neural network baselines. Note that we bold the performance numbers for the best performing model for each of the "+ POS" case and the "-POS" case.
Speech (POS) tagger (Manning et al., 2014) to get corresponding POS features for extracted tokens. In all of our models, we limit the maximum length of samples and POS tags to 60 tokens. For the CNN, sequences shorter than 60 tokens are zeropadded.    Table 4, shows the distribution of agreed and disagreed classification.

Analysis
Consider the following pair of samples that we randomly choose from the PTB dataset (shortened for readability): 1. ...Taped just as the market closed yesterday , it offers Ms. Farrell advising , " We view the market here as going through a relatively normal cycle ... . We continue to feel that the stock market is the @@@@ place to be for long-term appreciation 2. ...More people are remaining independent longer presumably because they are better off physically and financially . Careers count most for the well-to-do many affluent people @@@@ place personal success and money above family In both cases, the head word is place. In Example 1, the word continue (emphasized in the above text) suggests that adverb still could be used to modify head word place (i.e., ... the stock market is still the place ...). Further, it is also easy to see that place refers to stock market, which has occurred in the previous context. Our model correctly predicts this sample as containing a presupposition, this despite the complexity of the coreference across the text.
In the second case of the usage of the same main head word place in Example 2, our model falsely predicts the presence of a presupposition. However, even a human could read the sentence as "many people still place personal success and money above family". This underlies the subtlety and difficulty of the task at hand. The longrange dependencies and interactions within sentences seen in these examples are what motivate the use of the various deep non-linear models presented in this work, which are useful in detecting these coreferences, particularly in the case of attention mechanisms.

Conclusion
In this work, we have investigated the task of predicting adverbial presupposition triggers and introduced several datasets for the task. Additionally, we have presented a novel weighted-pooling attention mechanism which is incorporated into a recurrent neural network model for predicting the presence of an adverbial presuppositional trigger. Our results show that the model outperforms the CNN and LSTM, and does not add any additional parameters over the standard LSTM model. This shows its promise in classification tasks involving capturing and combining relevant information from multiple points in the previous context.
In future work, we would like to focus more on designing models that can deal with and be optimized for scenarios with severe data imbalance. We would like to also explore various applications of presupposition trigger prediction in language generation applications, as well as additional attention-based neural network architectures.