A Hybrid Neural Network Model for Commonsense Reasoning

This paper proposes a hybrid neural network(HNN) model for commonsense reasoning. An HNN consists of two component models, a masked language model and a semantic similarity model, which share a BERTbased contextual encoder but use different model-specific input and output layers. HNN obtains new state-of-the-art results on three classic commonsense reasoning tasks, pushing the WNLI benchmark to 89%, the Winograd Schema Challenge (WSC) benchmark to 75.1%, and the PDP60 benchmark to 90.0%. An ablation study shows that language models and semantic similarity models are complementary approaches to commonsense reasoning, and HNN effectively combines the strengths of both. The code and pre-trained models will be publicly available at https: //github.com/namisan/mt-dnn.


Introduction
Commonsense reasoning is fundamental to natural language understanding (NLU).As shown in the examples in Table 1, in order to infer what the pronoun "they" refers to in the first two statements, one has to leverage the commonsense knowledge that "demonstrators usually cause violence and city councilmen usually fear violence."Similarly, it is obvious to humans what the pronoun "it" refers to in the third and fourth statements due to the commonsense knowledge that "An object cannot fit in a container because either the object (trophy) is too big or the container (suitcase) is too small."by native English-speaker (Levesque et al., 2011), and yet are challenging for machines.For example, the WNLI task, which is derived from WSC, is considered the most challenging NLU task in the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018).Most machine learning models can hardly outperform the naive baseline of majority voting (scored at 65.1)1 , including BERT (Devlin et al., 2018a) and Distilled MT-DNN (Liu et al., 2019a).
While traditional methods of commonsense reasoning rely heavily on human-crafted features and knowledge bases (Rahman and Ng, 2012a;Sharma et al., 2015;Schüller, 2014;Bailey et al., 2015;Liu et al., 2017), we explore in this study machine learning approaches using deep neural networks (DNN).Our method is inspired by two categories of DNN models proposed recently.
The first are neural language models trained on large amounts of text data.Trinh and Le (2018) proposed to use a neural language model trained on raw text from books and news to calculate the probabilities of the natural language sentences which are constructed from a statement by replacing the to-be-resolved pronoun in the statement with each of its candidate references (antecedent), and then pick the candidate with the highest probability as the answer.Kocijan et al. (2019) showed that a significant improvement can be achieved by fine-tuning a pre-trained masked language model (BERT in their case) on a small amount of WSC labeled data.
The second category of models are semantic similarity models.Wang et al. (2019) formulated WSC and PDP as a semantic matching problem, and proposed to use two variations of the Deep Structured Similarity Model (DSSM) (Huang et al., 2013) to compute the semantic similarity score between each candidate antecedent and the pronoun by (1) mapping the candidate and the pronoun and their context into two vectors, respectively, in a hidden space using deep neural networks, and (2) computing cosine similarity between the two vectors.The candidate with the highest score is selected as the result.
The two categories of models use different inductive biases when predicting outputs given inputs, and thus capture different views of the data.While language models measure the semantic co-herence and wholeness of a statement where the pronoun to be resolved is replaced with its candidate antecedent, DSSMs measure the semantic relatedness of the pronoun and its candidate in their context.
Therefore, inspired by multi-task learning (Caruana, 1997;Liu et al., 2015Liu et al., , 2019b)), we propose a hybrid neural network (HNN) model that combines the strengths of both neural language models and a semantic similarity model.As shown in Figure 1, HNN consists of two component models, a masked language model and a deep semantic similarity model.The two component models share the same text encoder (BERT), but use different model-specific input and output layers.The final output score is the combination of the two model scores.The architecture of HNN bears a strong resemblance to that of Multi-Task Deep Neural Network (MT-DNN) (Liu et al., 2019b), which consists of a BERT-based text encoder that is shared across all tasks (models) and a set of task (model) specific output layers.Following (Liu et al., 2019b;Kocijan et al., 2019), the training procedure of HNN consists of two steps: (1) pretraining the text encoder on raw text2 , and (2) multi-task learning of HNN on WSCR which is the most popular WSC dataset, as suggested by Kocijan et al. (2019).
HNN obtains new state-of-the-art results with significant improvements on three classic commonsense reasoning tasks, pushing the WNLI benchmark in GLUE to 89%, the WSC benchmark3 (Levesque et al., 2011) to 75.1%, and the PDP-60 benchmark4 to 90.0%.We also conduct an ablation study which shows that language models and semantic similarity models provide complementary approaches to commonsense reasoning, and HNN effectively combines the strengths of both.

The Proposed HNN Model
The architecture of the proposed hybrid model is shown in Figure 1.The input includes a sentence S, which contains the pronoun to be resolved, and a candidate antecedent C. The two component models, masked language model (MLM) and se-

Masked Language Model (MLM)
This component model follows Kocijan et al. (2019).In the input layer, a masked sentence is constructed using S by replacing the to-beresolved pronoun in S with a sequence of N [MASK] tokens, where N is the number of tokens in candidate C.
In the output layer, the likelihood of C being referred to by the pronoun in S is scored using the BERT-based masked language model P mlm (C|S).If C = {c 1 ...c N } consists of multiple tokens, log P mlm (C|S) is computed as the average of logprobabilities of each composing token: (1)

Semantic Similarity Model (SSM)
In the input layer, we treat sentence S and candidate C as a pair (S, C) that is packed together as a word sequence, where we add the [CLS] token as the first token and the [SEP] token between S and C.
After applying the shared embedding layers, we obtain the semantic representations of S and C, denoted as s ∈ R d and c ∈ R d , respectively.We use the contextual embedding of [CLS] as s.Suppose C consists of N tokens, whose contextual embeddings are h 1 , ..., h N , respectively.The semantic representation of the candidate C, c, is computed via attention as follows: where W 1 is a learnable parameter matrix, and α is the attention score.We use the contextual embedding of the first token of the pronoun in S as the semantic representation of the pronoun, denoted as p ∈ R d .In the output layer, the semantic similarity between the pronoun and the context is computed using a bilinear model: where W 2 is a learnable parameter matrix.Then, SSM predicts whether C is a correct candidate (i.e., (C, S) is a positive pair, labeled as y = 1) using the logistic function: .
(5) The final output score of pair (S, C) is a linear combination of the MLM score of Eqn. 1 and the SSM score of Eqn.5:

The Training Procedure
We train our model of Figure 1 on the WSCR dataset, which consists of 1886 sentences, each being paired with a positive candidate antecedent and a negative candidate.
The shared BERT encoder is initialized using the published BERT uncased large model (Devlin et al., 2018a).We then finetune the model on the WSCR dataset by optimizing the combined objectives: where L mlm is the negative log-likelihood based on the masked language model of Eqn. 1, and L ssm is the cross-entropy loss based on semantic similarity model of Eqn. 5. L rank is the pair-wise rank loss.Consider a sentence S which contains a pronoun to be resolved, and two candidates C + and C − , where C + is correct and C − is not.We want to maximize ∆ = Score(S, C + ) − Score(S, C − ), where Score(.) is defined by Eqn. 6.We achieve this via optimizing a smoothed rank loss: where γ ∈ [1, 10] is the smoothing factor and β ∈ [0, 1] the margin hyperparameter.In our experiments, the default setting is γ = 10, and β = 0.6.

Experiments
We evaluate the proposed HNN on three commonsense benchmarks: WSC (Levesque et al., 2012), PDP60 5 and WNLI.WNLI is derived from WSC, and is considered the most challenging NLU task in the GLUE benchmark (Wang et al., 2018 (Rahman and Ng, 2012b) for model training and selection.WSCR contains 1886 instances (1322 for training and the rest as dev set).Each instance is presented using the same structure as that in WSC.
For the WNLI instances, we convert them to the format of WSC as illustrated in Table 3: we first detect pronouns in the premise using spaCy6 ; then given the detected pronoun, we search its left of the premise in hypothesis to find the longest common substring (LCS) ignoring character case.Similarly, we search its right part to the LCS; by comparing the indexes of the extracted LSCs, we extract the candidate.A detailed example of the conversion process is provided in Table 3.

Implementation Detail
Our implementation of HNN is based on the Py-Torch implementation of BERT7 .All the models are trained with hyper-parameters depicted as follows unless stated otherwise.The shared layer is initialized by the BERT uncased large model.Adam (Kingma and Ba, 2014) is used as our optimizer with a learning rate of 1e-5 and a batch size of 32 or 16.The learning rate is linearly decayed during training with 100 warm up steps.We select models based on the dev set by greedily searching epochs between 8 and 10.The trainable parameters, e.g., W 1 and W 2 , are initialized by a truncated normal distribution with a mean of 0 and a 1. Premise: The cookstove was warming the kitchen, and the lamplight made it seem even warmer.
Hypothesis: The lamplight made the cookstove seem even warmer.standard deviation of 0.01.The margin hyperparameter, β in Eqn. 8, is set to 0.6 for MLM and 0.5 for SSM, and γ is set to 10 for all tasks.We also apply SWA (Izmailov et al., 2018) to improve the generalization of models.All the texts are tokenized using WordPieces, and are chopped to spans containing 512 tokens at most.

Results
We compare our HNN with a list of state-of-the-art models in the literature, including BERT (Devlin et al., 2018b), GPT-2 (Radford et al., 2019) and DSSM (Wang et al., 2019).The brief description of each baseline is introduced as follows.
1. BERT LARGE-LM (Devlin et al., 2018b): This is the large BERT model, and we use MLM to predict a score for each candidate following Eq 1.
2. GPT-2 (Radford et al., 2019): During prediction, We first replace the pronoun in a given sentence with its candidates one by one.We use the GPT-2 model to compute a score for each new sentence after the replacement, and select the candidate with the highest score as the final prediction.
3. BERT Wiki-WSCR and BERT WSCR (Kocijan et al., 2019): These two models use the same approach as BERT LARGE-LM , but are trained with different additional training data.For example, BERT Wiki-WSCR is firstly fine-tuned on the constructed Wikipedia data and then on WSCR.BERT WSCR is directly fine-tuned on WSCR.
4. DSSM (Wang et al., 2019): It is the unsupervised semantic matching model trained on the dataset generated with heuristic rules.
5. HNN: It is the proposed hybrid neural network model.
The main results are reported in Table 4. Compared with all the baselines, HNN obtains much better performance across three benchmarks.This clearly demonstrates the advantage of the HNN over existing models.For example, HNN outperforms the previous state-of-theart BERT Wiki-WSCR model with a 11.7% absolute improvement (83.6% vs 71.9%) on WNLI and a 2.8% absolute improvement (75.1% vs 72.2%) on WSC in terms of accuracy.Meanwhile, it achieves a 11.7% absolute improvement over the previous state-of-the-art BERT LARGE-LM model on PDP60 in accuracy.Note that both BERT Wiki-WSCR and BERT LARGE-LM are using language model-based approaches to solve the pronoun resolution problem.On the other hand, We observe that DSSM without pre-training is comparable to BERT LARGE-LM which is pre-trained on the large scale text corpus (63.0%vs 62.0% on WSC and 75.0%vs 78.3% on PDP60).Our results show that HNN, combining the strengths of both DSSM and BERT WSCR , has consistently achieved new state-of-the-art results on all three tasks.WNLI WSC PDP60 DSSM (Wang et al., 2019) -63.0 75.0 BERT LARGE-LM (Devlin et al., 2018a) 65.1 62.0 78.3 GPT-2 (Radford et al., 2019) -70.7 -BERT Wiki-WSCR (Kocijan et al., 2019) 71.9 72.2 -BERT WSCR (Kocijan et al., 2019) 70  To further boost the WNLI accuracy on the GLUE benchmark leaderboard, we record the model prediction at each epoch, and then produce the final prediction based on the majority voting from the last six model predictions.We refer to the ensemble of six models as HNN ensemble in Table 4. HNN ensemble brings a 5.4% absolute improvement (89.0%vs 83.6%) on WNLI in terms of accuracy.

Ablation study
In this section, we study the importance of each component in HNN by answering following questions: How important are the two component models: MLM and SSM?
To answer this question, we first remove each component model, either SSM or MLM, and then report the performance impact of these component models.Table 5 summarizes the experimental results.It is expected that the removal of ei-ther component model results in a significant performance drop.For example, with the removal of SSM, the performance of HNN is downgraded from 77.1% to 74.5% on WNLI.Similarly, with the removal of MLM, HNN only obtains 75.1%, which amounts to a 2% drop.All these observations clearly demonstrate that SSM and MLM are complementary to each other and the HNN model benefits from the combination of both.
Figure 2 gives several examples showing how SSM and MLM complement each other on WNLI.We see that in the first pair of examples, MLM correctly predicts the label while SSM does not.This is due to the fact that "the roof repaired" appears more frequently than "the tree repaired" in the text corpora used for model pre-training.However, in the second pair, since both "the demonstrators" and "the city councilment" could advocate violence and neither occurs significantly more often than the other, SSM is more effective in distinguishing the difference based on their context.The proposed HNN, which combines the strengths of these two models, can obtain the correct results in both cases.Does the additional ranking loss help?
As in Eqn. 7, the training objective of HNN model contains three losses.The first two are based on the two component models, respectively, and the third one, as defined in Eqn. 8, is a ranking loss based on the score function in Eqn. 6.At first glance, the ranking loss seems redundant.Thus, we compare two versions of HNN trained with and without the ranking loss.Experimental results are shown in Table 6.We see that without the ranking loss, the performance of HNN drops on three datasets: WNLI, WSCR and WSC.On the PDP60 dataset, without the ranking loss, the model performs slightly better.However, since the test set of PDP60 includes only 60 samples, the difference is not statistically significant.Thus, we decide to always include the ranking loss in the training objective of HNN.Is the WNLI task a ranking or classification task?The WNLI task can be formulated as either a ranking task or a classification task.To study the difference in problem formulation, we conduct experiments to compare the performance of a model used as a classifier or a ranker.For example, given a trained HNN, when it is used as a classifier we set a threshold to decide label (0/1) for each input.When it is used as a ranker, we simply pick the top-ranked candidate as the correct answer.We run the comparison using all three models HNN, MLM and SSM.As shown in Figure 3, the ranking formulation is consistently better than the classification formulation for this task.For example, the difference in the HNN model is about absolute 2.5% (74.6% vs 77.1%) in terms of accuracy.

Conclusion
We propose a hybrid neural network (HNN) model for commonsense reasoning.HNN consists of two component models, a masked language model and a deep semantic similarity model, which share a BERT-based contextual encoder but use different model-specific input and output layers.
HNN obtains new state-of-the-art results on three classic commonsense reasoning tasks, pushing the WNLI benchmark to 89%, the WSC benchmark to 75.1%, and the PDP60 benchmark to 90.0%.We also justify the design of HNN via a series of ablation experiments.
In future work, we plan to extend HNN to more sophisticated reasoning tasks, especially those where large-scale language models like BERT and GPT do not perform well, as discussed in (Gao et al., 2019;Niven and Kao, 2019).

Figure 1 :
Figure 1: Architecture of the hybrid model for commonsense reasoning.The model consists of two component models, a masked language model (MLM) and a semantic similarity model (SSM).The input includes the sentence S, which contains a pronoun to be resolve, and a candidate antecedent C. The two component models share the BERT-based contextual encoder, but use different model-specific input and output layers.The final output score is the combination of the two component model scores.
The cookstove was warming the kitchen, and the lamplight made it seem even warmer.Hypothesis: The lamplight made the kitchen seem even warmer.Index of LCS in the hypothesis: left[0, 2], right[5, 7] Candidate: [3, 4] (the kitchen) 3. Premise: The cookstove was warming the kitchen, and the lamplight made it seem even warmer.Hypothesis: The lamplight made the lamplight seem even warmer.Index of LCS in the hypothesis: left[0, 2], right[5, 7] Candidate: [3, 4] (the lamplight) 4. Converted: The cookstove was warming the kitchen, and the lamplight made it seem even warmer.A. the cookstove B. the kitchen C. the lamplight

Figure 3 :
Figure 3: Comparison of different task formulation on WNLI.

Table 2 :
). Summary of the three benchmark datasets: WSC, PDP60 and WNLI, and the additional dataset WSCR.Note that we only use WSCR for training.For WNLI, we merge its official training set containing 634 instances and dev set containing 71 instances as its final dev set.

Table 2
summarizes the datasets which are used in our experiments.Since the WSC and PDP60 datasets do not contain any training instances, following (Kocijan et al., 2019), we adopt the WSCR dataset

Table 3 :
Examples of transforming WNLI to WSC format.Note that the text highlighted by brown is the longest common substring from the left part of pronoun it, and the text highlighted by violet is the longest common substring from its right.

Table 4 :
Test resultsFigure 2: Comparison with SSM and MLM on WNLI examples.

Table 5 :
Ablation study of the two component model in HNN.Note that WNLI and WSCR are reported on dev sets while WSC and PDP60 are reported on test sets.

Table 6 :
Ablation study of the ranking loss.Note that WNLI and WSCR are reported on dev sets while WSC and PDP60 are reported on test sets.