Incorporating Contextual and Syntactic Structures Improves Semantic Similarity Modeling

Semantic similarity modeling is central to many NLP problems such as natural language inference and question answering. Syntactic structures interact closely with semantics in learning compositional representations and alleviating long-range dependency issues. How-ever, such structure priors have not been well exploited in previous work for semantic mod-eling. To examine their effectiveness, we start with the Pairwise Word Interaction Model, one of the best models according to a recent reproducibility study, then introduce components for modeling context and structure using multi-layer BiLSTMs and TreeLSTMs. In addition, we introduce residual connections to the deep convolutional neural network component of the model. Extensive evaluations on eight benchmark datasets show that incorporating structural information contributes to consistent improvements over strong baselines.


Introduction
Modeling the semantic similarity between a pair of sentences is a fundamental task in natural language processing. It is the core problem of many tasks such as question answering (He et al., 2015;Rao et al., 2017;Wang et al., 2018) and query ranking (Mitra and Craswell, 2019). Recently, various neural networks have been proposed for textual similarity modeling. These models share three main components: (1) sequential sentence encoders, which incorporate word context and sentence order for better sentence representations, e.g., by using recurrent neural networks (RNNs; Mikolov et al., 2010;Seo et al., 2016), (2) interaction and attention mechanisms, which use the encoding outputs of sentences to calculate or reweight salient word pair interactions (He and Lin, 2016;Chen et al., 2017), and (3) incorporating syntactic parsing information as an intuitive structure prior for sentence modeling (Chen and Manning, 2014;Zhao et al., 2016;Chen et al., 2017).
Our work is inspired by the recent reproducibility study by Lan and Xu (2018), which examines many neural network architectures for semantic similarity modeling through extensive evaluations on multiple benchmark datasets. Their results suggest that syntactic structure information captured by a TreeLSTM encoder either provides few benefits or even hurts performance. Structure information has often been overlooked in recent semantic modeling methods, such as InferSent (Conneau et al., 2017), DecAtt (Parikh et al., 2016), and BiMPM (Wang et al., 2017). It is not yet clear whether the syntactic structures implicitly captured by sequential modeling of texts from large annotated data or existing structure modeling techniques (Tai et al., 2015;Kipf and Welling, 2017) are effective in learning tree representations.
To further explore the effects of tree structures in sentence modeling, we start with the Pairwise Word Interaction Model (PWIM) of He and Lin (2016) as our base architecture, which has shown strong performance on various datasets from Lan and Xu (2018). In summary, PWIM uses a Bi-LSTM to learn word-level context vectors from both input sentences and builds a novel similarity focus layer with pairwise metrics to identify important word pairs. It then converts the similarity measurement problem to a pattern recognition problem for the final classification. We argue that PWIM approaches semantic modeling from a word-level matching perspective, and hence fails to capture syntactic and contextual semantics. To this end, we add multi-layer BiLSTMs with shortcut connections to capture long-range context, as well as TreeLSTM encoders to capture the syntactic structure of sentences.
We conduct thorough evaluations across eight datasets in four NLP tasks: paraphrase identification, semantic textual similarity, natural language inference, and answer sentence selection. . Pairwise word interactions are modeled through a multimetric comparison unit coU , which computes cosine distance, L 2 distance, and dot-product distance over two hidden states. This comparison unit is applied to not only the forward and backward hidden states h for t and h back t , but also their concatenation The output is a similarity tensor of size R 13×|sent1|×|sent2| with one extra dimension for the padding indicator. Instead of using attention weight vectors or weighted representations, He and Lin apply a focus layer on the similarity tensor to decrease the weights of unimportant word interactions by a factor of ten. They then consider the tensor as an "image" with 13 channels and use a 19-layer-deep convolutional neural network to predict the final classification.

Residual Connections
Since He and Lin (2016) phrase the similarity measurement problem as a pattern recognition (image processing) problem and apply deep convolutional neural networks, we explore the addition of residual connections (He et al., 2016) to deal with the potential vanishing gradient problems in deep networks. A building block is defined as y = f (x, W i ) + x, where x, y are the input and output of the layer considered, and f (x, W i ) is the learned residual mapping.

Hybrid Inference Model with Parse Trees
While stacked BiLSTMs capture long-term dependency and contextual information over each sentence, we are also interested in investigating explicit hierarchical relationships among linguistic phrases and clauses. To incorporate this domain-specific information, we use the Dependency TreeLSTM (Tai et al., 2015), whose nodes condition their components on the sum of the hidden states of their children. Suppose h L and h R are the sentence representations in the pair over the parse tree of each sentence: to model similarity, we compute the element-wise product h L h R and absolute difference |h L − h R |. Then, we feed the two similarity vectors to a fully-connected layer with softmax whose output is the probability distribution over labels. To compute the final label for the sentence pair, we interpolate between the output probabilities of this model and those of PWIM. Chen et al. (2017) also incorporate tree structures produced by a constituency parser into the ESIM model, then average the predicted probabilities.

Experimental Setup
We conducted experiments on eight separate datasets-one natural language inference dataset, two paraphrase identification datasets, three SemEval competition datasets, and two QA datasets-which are as follows: • SNLI (Bowman et al., 2015) is a collection of 570k manually-labeled sentence pairs for  • WikiQA (Yang et al., 2015) is an opendomain question-answering dataset. After applying the same pre-processing methods in He and Lin (2016), it contains 12k question-answer pairs with binary labels.
• TrecQA (Wang et al., 2007) is from the Text Retrieval Conferences and consists of 56k question-answer pairs.
The first seven datasets are the same as the ones examined in Lan and Xu (2018), except for MNLI (Williams et al., 2018), since SNLI is much larger than MNLI for the task of natural language inference. We also add the SICK dataset (Marelli et al., 2014), which is unexplored in Lan and Xu (2018). Across multiple tasks and domains, we systematically compare our proposed models with state-of-the-art neural models: InferSent  Table 1 shows the results of our models on different datasets. The first block of the table contains figures copied directly from Lan and Xu (2018); note that they do not use SICK. PWIM our refers to our own implementation. Note that there are at least three independent open-source implementations of the PWIM base model that we are aware of, which confirms the robustness and reproducibility of the model. Most results from these implementations are consistent; however, for PIT-2015 and STS-2014, we observe some differences, which we were unable to reconcile even after contacting the previous authors. Thus, for comparison purposes, we report results from our base PWIM our implementation.

Effects of the Multi-Layer BiLSTM
The entry mPWIM seq denotes PWIM using multilayer BiLSTMs for modeling the context of the input sentences and also incorporating residual connections in the final classification. On all datasets listed in the table, adding multi-layer BiLSTMs leads to a higher performance than that of the original model PWIM our . SSE (Nie and Bansal, 2017) is a stacked Bi-LSTM model with shortcut connections and finetuning of word embeddings. Unlike our setting, where each word is represented by its own hidden state in the final output layer, SSE applies maxpooling over time to the output of the last BiLSTM layer to extract the final sentence feature vector. Based on Table 1, mPWIM seq clearly outperforms SSE on Twitter, PIT-2015, STS-2014, WikiQA, and TrecQA. However, for the SNLI and Quora datasets, SSE slightly exceeds mPWIM by 0.4% and 1.6%, respectively. SNLI and Quora have the largest training data among all the datasets with 550k and 393k training sentence pairs, respectively, which suggests that SSE performs better on larger data beyond a certain threshold. We surmise that as the dataset increases in size, the simplicity of SSE will have more performance advantages.

Effects of TreeLSTM
The mPWIM seq+tree further enhances mPWIM seq by incorporating syntactic TreeLSTMs based on syntactic parse trees of each sentence. It averages the prediction probabilities of the PWIM using multi-layer BiLSTMs and TreeLSTMs separately to arrive at the final label. ESIM seq+tree also computes its final predictions by averaging predic-tion probabilities of two ESIM variants that use BiLSTMs and TreeLSTMs as sentence encoders, respectively.
From the table, we observe that adding Tree-LSTMs to the ESIM model only marginally helps or has no effect for most datasets. On the other hand, TreeLSTM complements PWIM well: for WikiQA, it increases mean average precision (MAP) by 1.8% and mean reciprocal rank (MRR) by 2.3%. TreeLSTM also contributes to an 1.1% increase in the F1-measure for PIT-2015, 0.9% Pearson's r for SICK, and 0.7% MAP for TrecQA.
We hypothesize that these observed differences can be attributed to the model architectures. The inference model of ESIM is based on chain LSTMs, which might encode overlapping information with TreeLSTMs. For PWIM, the sentence context information is transformed into pairwise word interaction similarity units, and then a 19-layer-deep CNN exploits the spatially localized patterns. During this process, its focus is word-level similarities in sentences. The syntactic parsing structure introduced by TreeLSTM compensates for some of the information deficiencies. Notably, TreeLSTM does not help PWIM on the Twitter dataset; this makes sense, as Kong et al. (2014) note that many elements in tweets have no syntactic function, including hashtags and URLs. Furthermore, tweets often contain multiple fragments, each with its own syntactic span. Both of these issues may degrade the quality of the syntactic modeling of tweets.

Sample Visualization and Analysis
To better understand why our models achieve improved effectiveness, we visualize the cosine values of the focusCube (the final output of the similarity layer) for pairwise word interactions in mPWIM seq and mPWIM seq+tree , using the same method as Chen et al. (2017), where darker colors indicate stronger pairwise word interactions.
In Figure 1, we show visualizations from two pairs of sentences from SICK: 1a and 1b form a contrastive pair, as do 1c and 1d. We see that, in both cases, the TreeLSTM helps the model find syntactically important pairwise word interactions. For example, in Figure 1a, for the mPWIM seq model, the cluster of dark patches near the top shows obviously irrelevant correspondences, e.g., "on" with "comfortably", "dead" with "tree", and several of the articles are misaligned with respect to their positions in phrase structure. With the incorporation of syntactic information in Figure 1b, the correspondences are much more accurate.
We see that this is similarly the case when comparing Figures 1c and 1d, where the TreeLSTM yields more accurate correspondences. With the TreeLSTM, the model has learned the correct correspondence between "is being jumped" and "is jumping over", whereas without the syntactic structure, the correspondences are quite muddled. In both cases, we observe that mPWIM seq+tree is able to capture the passive construction for paraphrase detection.

Conclusion
We examine the hypothesis of whether incorporating contextual and syntactic structures can improve semantic similarity modeling. We extend the strong PWIM model and add additional components comprised of TreeLSTMs and multi-layer BiLSTMs to capture syntax and context information. Thorough experiments on eight datasets show that our improved models achieve consistent gains in effectiveness.