Consistent CCG Parsing over Multiple Sentences for Improved Logical Reasoning

In formal logic-based approaches to Recognizing Textual Entailment (RTE), a Combinatory Categorial Grammar (CCG) parser is used to parse input premises and hypotheses to obtain their logical formulas. Here, it is important that the parser processes the sentences consistently; failing to recognize the similar syntactic structure results in inconsistent predicate argument structures among them, in which case the succeeding theorem proving is doomed to failure. In this work, we present a simple method to extend an existing CCG parser to parse a set of sentences consistently, which is achieved with an inter-sentence modeling with Markov Random Fields (MRF). When combined with existing logic-based systems, our method always shows improvement in the RTE experiments on English and Japanese languages.


Introduction
While today's neural network-based syntactic parsers (Dyer et al., 2016;Dozat and Manning, 2017;Yoshikawa et al., 2017) have proven successful on sentence level modeling, it is still challenging to accurately process texts that go beyond a single sentence (e.g.coreference resolution, discourse structure analysis).In this work we focus, among others, on the consistent analysis of multiple sentences in a document.This is as an important problem in reasoning tasks as other document analysis.
RTE is an elemental technology for semantic analysis of multiple sentences, where, given a text (T) and a hypothesis (H), a system determines if T entails H. Existing methods based on formal logic (Bos, 2008;Martínez-Gómez et al., 2017;Abzianidze, 2017) obtain logical formulas for T and H using an off-the-shelf CCG parser, and then feed them to a theorem prover.The standard approach to mapping CCG trees onto logical formulas is to assign λ-terms to the words in a sentence and combine them in a bottom-up fashion (Figure 1a).Here, when the parser fails to make consistent analyses for T and H, the succeeding inference component is also doomed to failure.In Figure 1b, when the parser wrongly analyzes "man exercising" in H as "man" modifying "exercising", the entailment relation cannot be established, due to the different argument structures of exercise in the resulting formulas.While it is ideal to enhance the overall performance of a parser, it is not cheaply obtainable.Additionally, neural network-based parsers are susceptible to subtle changes in the input and thus hard to inspect and modify its parameters to change its prediction.Due to this, we cannot expect that a particular pair of words across multiple sentences be always analyzed in a consistent manner.
In this work, we solve the inconsistency prob-lem above by adapting the inter-sentence model of Rush et al. (2012) to CCG parsing.Their motivation is to exploit the similarities among test sentences to overcome situations where the amount of the training data is scarce or its domain is different from the test data.The method based on dual decomposition tries to find parse trees for a set of sentences that agree with an MRF, which encourages the assignment of a similar structure to similar contexts.
In our approach, we aim to eliminate wrong logical formulas such as in Figure 1 by rewarding consistent CCG parses across sentences.This, in turn, is achieved by rewarding the consistent assignment of categories to the terminals.This works for CCG parsing, as its derivation is mostly determined by the terminal categories.The key of our approach is that by combining A* parsing of Yoshikawa et al. (2017) with dual decomposition, we can keep small the latency incurred by the use of the iterative algorithm.
We conducted experiments using two state-of-the-art logic-based systems (Martínez-Gómez et al., 2017;Abzianidze, 2017) and two RTE datasets for English and Japanese languages.Our method always shows improvement compared to the baselines.

Method
We describe our approach of modeling the inter-consistencies among CCG trees Y = y 1 , . . ., y N for sentences X = x 1 , . . ., x N ( §2.1), 1 A* parsing method for each y i ( §2.2) and joint decoding of the MRF and A* parsing using dual decomposition ( §2.3).

Document Consistencies with MRF
To model inter-consistencies among CCG parses, we adapt the global MRF model of Rush et al. (2012).See Figure 2 for an example MRF.Our MRF encourages the assignment of similar categories to the words appearing in similar contexts.
Firstly we construct a graphical representation of an MRF.For each context (unigram surface form in the case of Figure 2) c ∈ C, we have a set W c of indices s, t that appear in c, where s is a sentence index and t a word index on sentence 1 In this work, we focus on the inconsistency problem of premises and hypotheses of RTE task, and thus X does not contain sentences from any "training data", as was done in Rush et al. (2012).Exploiting external resources in the same manner is also an interesting future direction.
H: There is no man exercising We assign to each node in the graph a label from a set of CCG categories T , so as to maximize the global consistency score g.By combining g with local CCG parsing for each y, we aim to obtain globally consistent trees Y ( §2.3).We define label assignment z to nodes in V as In the following, z w denotes the element in z at the index corresponding to w ∈ W (similarly z ′ c for c ∈ C).Following Rush et al. (2012), we allow N U LL label for context nodes.This works as a switch to "turn off" the consistency constraints to the connected nodes.Then, in the set Z(X) of all possible zs for X, we look for z * = arg max z∈Z(X) g(z), where g(z) is2 : To reward the consistent assignment of categories among connected nodes, f w,c is defined as follow: where δ 1 ≥ δ 2 ≥ δ 3 and simpl removes feature values from a category (e.g.simpl(S dcl \N P ) = S\N P ). for f w , we use log P tag obtained by CCG parser ( §2.2).We tune δ i s based on the RTE performance on the development set.
Since the above MRF g(z) has a simple naïve Bayes structure, we can compute argmax using dynamic programming.

A* CCG Parsing
To parse a sentence, we use the state-of-the-art A* parsing method of Yoshikawa et al. (2017), which treats a CCG tree y as a tuple c, h of categories c = c 1 , . . ., c M and dependency structure h = h 1 , . . ., h M , where each h i is a head index.They model a tree with a locally factored model; the probability of a CCG tree is the product of the probabilities of the categories p tag and the dependency heads p dep of all words in x: Note that the most computationally heavy part of their method is the calculation of P tag|dep , which needs to be done only once in our extension with dual decomposition.The additional computational cost of our method is rather small, as it depends on the number of times to run A* algorithm on the precomputed P tag|dep , which is quite efficient. 3he probability P (Y |X) of parses Y for X under this model is simply the product of all y i s: where Y(X) is the space of all possible parses for X.

Dual Decomposition
To obtain CCG parses Y for sentences X that are optimal in terms of both the global consistency model ( §2.1) and the local parsing model ( §2.2), we solve the following problem using dual decomposition: where c s,t is the category assigned on t'th word in y s .The condition in the equation states that the Algorithm 1 Joint CCG parsing and global MRF decoding ⊲ J: a set of pairs of word nodes and categories in MRF ⊲ α: step size (0.0 < α ≤ 1.0) decoded Y * and z * must agree in the category assignment to word nodes in the MRF.Alg. 1 shows the pseudocode for dual decomposition applied to our method.Note that all the decoding subproblems can be kept intact even when added the Lagrangian multiplier u of dual decomposition.

Experimental Settings
English In English experiment, we test the performance of ccg2lambda (Martínez-Gómez et al., 2017) and LangPro (Abzianidze, 2017) on SICK dataset (Marelli et al., 2014) 4 .As mentioned earlier, these systems try to prove whether T entails H, by applying a theorem prover to the logical formulas converted from the CCG trees.We report results for ccg2lambda with the default settings (with SPSA abduction; Martínez-Gómez et al. ( 2017)) and results for two versions of LangPro, one which is described in Abzianidze (2015) (henceforth we refer to it as LangPro15) and the other in Abzianidze (2017) (LangPro17). 5Briefly, the difference between the two versions is that LangPro17 is more robust to parse errors.See the paper for the detail.For the CCG parser in §2.2, we use depccg6 with an MRF in §2.1.We compare our results with depccg without the MRF and base- lines reported in the above papers that use Easy-CCG (Lewis and Steedman, 2014).
In MRF, a context node is constructed when two or more words from both T and H share the same surface form.Exceptionally, some pairs of categories are allowed to be aligned with score δ 1 : a pair of noun modifier (N/N ) and verb tense (S ng \N P ), which are categories for present participles, and a pair of nominal modifier (N/N ) and noun (N ).In the experiment using ccg2lambda the pairs of categories of transitive and intransitive verbs, ((S X \N P )/N P , S X \N P ) and ((S X \N P )/P P , S X \N P ), for any feature X are also allowed with δ 1 .
Japanese In Japanese experiment, we evaluate ccg2lambda's performance on JSeM dataset (Kawazoe et al., 2017).To construct an MRF graph, we processed RTE problems with kuromoji 7 and made a context node for a noun or a verb followed by an adverb.The reason why we use bigram POS tag-based context is that the graph construction based on the surface form has resulted in poor RTE performance, by overgenerating MRF constraints.This may be due to the fact that Japanese sentences are usually tokenized into smaller units.We used depccg and the same hyperparameters as English experiment.

Results and Error Analysis
We show the results on SICK in Table 1.Our MRF consistently contributes to the improvement of the accuracies for both ccg2lambda and LangPro.We observe the same tendency in the scores for all systems; with MRF, both the accuracy and recall for the systems moderately improve and the systems using depccg have higher recall and lower precision compared to the ones with EasyCCG (with LangPro17 it marks higher precision as well).
In SICK, there are many instances of the construction shown in Figure 1 ("There is no man exercising", "There is no dog barking", etc.), whose correct reading is that the last verb (e.g.exercising) is a present participle modifying a noun (e.g.man).EasyCCG and default depccg wrongly parse the last phrase (man exercising) as N/N N , where man modifies exercising.Our method correctly predicts N S ng \N P , by utilizing the paired sentence (e.g."A man is exercising"), in which the role of exercising is less ambiguous.
Given that the strength of LangPro17 is its robustness to parse errors such as PP-attachment, the larger gain in the accuracy for LangPro15 (roughly 0.5 versus 0.1 point up) indicates that our method is also robust in handling well-known difficult parsing problems.The example (a) in Table 3 is a case of coordinate construction.Baseline depccg wrongly coordinates crocheting with a noun sofa, while our method successfully resolves the correct coordinate structure by assigning S ng \N P to the word (hence attaching it to sitting).Example (b) is one of the cases of PP-attachment that our method successfully resolved.Our method relocates the two PPs in T in their correct places.As in the example in Figure 1, our method corrects cases like (a) and (b) by using the structure of the less ambiguous counterpart as a guide.In the case of (c), the existing parsers misclassify outdoors in T as a noun and turns the verb run into a transitive verb.With our method, intransitive verb run in H works as a soft constraint on the verb in T and corrects its structure successfully.However, there are some cases where using only surface forms as a cue forces the assignment of categories which is consistent but not desirable.In example (d), eat is used as a transitive verb in T and as an intransitive verb in H; thus it should have different categories.
We show the results on JSeM in Table 2.The RTE performance for Japanese language has improved consistently across all the scores when we add an MRF.However all the scores with depccg (with or without MRF) lag behind the scores reported in Mineshima et al. (2016), which uses a CCG parser implemented in Jigg (Noji and Miyao, 2016).We hypothesize that this is due to the fact that the previous work created the semantic templates for this language by analyzing parse outputs by Jigg and this resulted in a kind of "overfitting" in the templates.
In the above experiments, our method worked well, mainly due to the fact that the sentences in these datasets have comparably simple structure.However, in other datasets, there are naturally more complex cases as in Table 3 (d), where we want different syntactic analyses for occurences of words with the same surface form.We can counter these cases by simply extending the definition of "context" by N-grams or the use of POS tag as we did in the Japanese experiment.Developing a machine learning-based method that selects which contexts to use and set δ i s automatically is also an important future work.

Conclusion and Future Work
In this work, by modeling the inter-consistencies of multiple sentences in CCG parsing, we have successfully improved the performance of the formal logic-based methods to RTE.Still, there can be pairs of words in more complex RTE problems that should not have the same category but that our method wrongly force them to.This is mainly due to the fact that we hand-tuned rules to construct context nodes.In future work, we extend the method so that it learns when to set an MRF constraint.

Figure 1 :
Figure 1: (a) An example semantic template for verbs V that associates a CCG category S\N P with a λterm.(b) A logical formula of a sentence is obtained at the root of a tree by composing λ-terms of all words following CCG combinatory rules.In this Figure, hypothesis H is wrongly parsed (See the text for details).

Figure 2 :
Figure 2: An MRF graph is made up of cliques each consisting of one context node (∈ C; circles) and word nodes (∈ W ; rectangles) instantiating that context.As such, each clique expresses the interdependencies among words appearing across sentences.
girl is sitting on the couch and is [ Sng\N P crocheting] H: The girl is sitting on the sofa and crocheting crocheting: ✗ N ❀ ✓ S ng \N P (b) T: A veteran is showing different things from a war to some people H: Different things [ (N P\N P)/N P from] a war are being shown [ ((S\N P)\(S\N P))/N P to] some people by a veteran from: ✗ ((S\N P )\(S\N P ))/N P ❀ ✓ (N P \N P )/N P to: ✗ (N P \N P )/N P ❀ ✓ ((S\N P )\(S\N P ))/N P (c) T: A few man in a competition are [ Sng \N P running] outside H: A few man in a competition are running outdoors running: ✗ (S ng \N P )/N P ❀ ✓S ng \N P (d) T: A man is [ (Sng \N P )/N P eating] some food H: The person is eating eating: ✓ S ng \N P ❀ ✗ (S ng \N P )/N P Table 3: Example parse results in SICK test set.(a), (b), (c) With the global MRF model, words in bold font previously assigned a wrong category (✗) have been assigned a correct one (✓).(d) is a case where the MRF is too strict and leads to the wrong assignment.

Table 1 :
RTE results on test section of SICK