To Attend or not to Attend: A Case Study on Syntactic Structures for Semantic Relatedness

With the recent success of Recurrent Neural Networks (RNNs) in Machine Translation (MT), attention mechanisms have become increasingly popular. The purpose of this paper is two-fold; firstly, we propose a novel attention model on Tree Long Short-Term Memory Networks (Tree-LSTMs), a tree-structured generalization of standard LSTM. Secondly, we study the interaction between attention and syntactic structures, by experimenting with three LSTM variants: bidirectional-LSTMs, Constituency Tree-LSTMs, and Dependency Tree-LSTMs. Our models are evaluated on two semantic relatedness tasks: semantic relatedness scoring for sentence pairs (SemEval 2012, Task 6 and SemEval 2014, Task 1) and paraphrase detection for question pairs (Quora, 2017).


Introduction
Recurrent Neural Networks (RNNs), in particular Long Short-Term Memory Networks (LSTMs) (Hochreiter and Schmidhuber, 1997), have demonstrated remarkable accomplishments in Natural Language Processing (NLP) in recent years. Several tasks such as information extraction, question answering, and machine translation have benefited from them. However, in their vanilla forms, these networks are constrained by the sequential order of tokens in a sentence. To mitigate this limitation, structural (dependency or constituency) information in a sentence was exploited and witnessed partial success in various tasks (Goller and Kuchler, 1996;Yamada and Knight, 2001;Quirk et al., 2005;Socher et al., 2011;Tai et al., 2015).
On the other hand, alignment techniques (Brown et al., 1993) and attention mechanisms (Bahdanau et al., 2014) act as a catalyst to augment the performance of classical Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models, respectively. In short, both approaches focus on sub-strings of source sentence which are significant for predicting target words while translating. Currently, the combination of linear RNNs/LSTMs and attention mechanisms has become a de facto standard architecture for many NLP tasks.
At the intersection of sentence encoding and attention models, some interesting questions emerge: Can attention mechanisms be employed on tree structures, such as Tree-LSTMs (Tai et al., 2015)? If yes, what are the possible tree-based attention models? Do different tree structures (in particular constituency vs. dependency) have different behaviors in such models? With these questions in mind, we present our investigation and findings in the context of semantic relatedness tasks.

Long Short-Term Memory Networks (LSTMs)
Concisely, an LSTM network (Hochreiter and Schmidhuber, 1997) (Figure 1) includes a memory cell at each time step which controls the amount of information being penetrated into the cell, neglected, and yielded by the cell. Various LSTM networks (Greff et al., 2017) have been explored till now; we focus on one representative form.
To be more precise, we consider a LSTM memory cell involving: an input gate i t , a forget gate f t , and an output gate o t at time step t. Apart from ... Figure 1: A linear LSTM network. w t is the word embedding, h t is the hidden state vector, c t is the memory cell vector and y t is the final processed output at time step t.
the hidden state h t−1 and input embedding w t of the current word, the recursive function in LSTM also takes the previous time's memory cell state, c t−1 , into account, which is not the case in simple RNN. The following equations summarize a LSTM memory cell at time step t: where: where d is the dimension of the hidden state vector and D is the dimension of the input word embedding, w t .
• c t ∈ R d is the new memory cell vector at time step t.
As can be seen in Eq. 5, the input gate i t limits the new information, u t , by employing the element wise multiplication operator . Moreover, the forget gate f t regulates the amount of information from the previous state c t−1 . Therefore, the current memory state c t includes both new and previous time step's information but partially. A dependency tree A natural extension of LSTM network is a bidirectional LSTM (bi-LSTM), which lets the sequence pass through the architecture in both directions and aggregate the information at each time step. Again, it strictly preserves the sequential nature of LSTMs.

Linguistically Motivated Sentence Structures
Most computational linguists have developed a natural inclination towards hierarchical structures of natural language, which follow guidelines collectively referred to as syntax. Typically, such structures manifest themselves in parse trees. We investigate two popular forms: Constituency and Dependency trees.

Constituency structure
Briefly, constituency trees (Figure 2:a) indicate a hierarchy of syntactic units and encapsulate phrase grammar rules. Moreover, these trees explicitly demonstrate groups of phrases (e.g., Noun Phrases) in a sentence. Additionally, they discriminate between terminal (lexical) and non-terminal nodes (non-lexical) tokens.

Dependency structure
In short, dependency trees ( Figure 2:b) describe the syntactic structure of a sentence in terms of the words (lemmas) and associated grammatical relations among the words. Typically, these dependency relations are explicitly typed, which makes the trees valuable for practical applications such as information extraction, paraphrase detection and semantic relatedness.

Tree Long Short-Term Memory Network (Tree-LSTM)
Child-Sum Tree-LSTM (Tai et al., 2015) is an epitome of structure-based neural network which explicitly capture the structural information in a sentence. Tai   a parent node can be consolidated selectively from each of its child node. Architecturally, each gated vector and memory state update of the head node is dependent on the hidden states of its children in the Tree-LSTM. Assuming a good tree structure of a sentence, each node j of the structure incorporates the following equations.: where: • w j ∈ R D represents word embedding of all nodes in Dependency structure and only terminal nodes in Constituency structure. 2 2 wj is ignored for non-terminal nodes in a Constituency structure by removing the wW terms in Equations 8-11. • c j ∈ R d is the new memory state vector of node j.
• C(j) is the set of children of node j.
• f jk ∈ R d is the forget gate vector for child k of node j.

Referring to Equation 12
, the new memory cell state, c j of node j, receives new information, u j , partially. More importantly, it includes the partial information from each of its direct children, set C(j), by employing the corresponding forget gate, f jk .
When the Child-Sum Tree model is deployed on a dependency tree, it is referred to as Dependency Tree-LSTM, whereas a constituency-treebased instantiation is referred to as Constituency Tree-LSTM.

Attention Mechanisms
Alignment models were first introduced in statistical machine translation (SMT) (Brown et al., 1993), which connect sub-strings in the source sentence to sub-strings in the target sentence.
Recently, attention techniques (which are effectively soft alignment models) in neural machine translation (NMT) (Bahdanau et al., 2014) came into prominence, where attention scores are calculated by considering words of source sentence while decoding words in target language. Although effective attention mechanisms (Luong et al., 2015) such as Global Attention Model (GAM) ( Figure 4) and Local Attention Model (LAM) have been developed, such techniques have not been explored over Tree-LSTMs.

Inter-Sentence Attention on Tree-LSTMs
We present two types of tree-based attention models in this section. With trivial adaptation, they can be deployed in the sequence setting (degenerated trees).

Modified Decomposable Attention (MDA)
Parikh et al. (2016)'s original decomposable intersentence attention model only used word embeddings to construct the attention matrix, without any structural encoding of sentences. Essentially, the model incorporated three components: Attend: Input representations (without sequence or structural encoding) of both sentences, L and R, are soft-aligned.
Compare: A set of vectors is produced by separately comparing each sub-phrase of L to subphrases in R. Vector representation of each subphrase in L is a non-linear combination of representation of word in sentence L and its aligned sub-phrase in sentence R. The same holds true for the set of vectors for sentence R.
Aggregate: Both sets of sub-phrases vectors are summed up separately to form final sentence representation of sentence L and sentence R.
We decide to augment the original decomposable inter-sentence attention model and generalize it into the tree (and sequence) setting. To be more specific, we consider two input sequences: L = (l 1 , l 2 ....l len L ), R = (r 1 , r 2 ....r len R ) and their corresponding input representations:L = (l 1 , l 2 ....l len L ),R = (r 1 ,r 2 ....r len R ); where len L and len R represents number of words in L and R, respectively.

MDA on dependency structure
Let's assume sequences L and R have dependency tree structures D L and D R . In this case, len L and len R represents number of nodes in D L and D R , respectively. After using a Tree-LSTM to encode tree representations, which results in: D L = (l 1 , l 2 ....l len L ), D R = (r 1 ,r 2 ....r len R ), we gather unnormalized attention weights, e ij and normalize them as follows: From the equations above, we can infer that the attention matrix will have a dimension len L x len R . In contrast to the original model, we compute the final representations of the each sentence by concatenating the LSTM-encoded representation of root with the attention-weighted representation of the root 3 : where G is a feed-forward neural network. h L and h R are final vector representations of input sequences L and R, respectively.

MDA on constituency structure
Let's assume sequences L and R have constituency tree structures C L and C R . Moreover, assume C L and C R have total number of nodes as N L (> len L ) and N R (> len R ), respectively. As in 3.1.1, the attention mechanism is employed after encoding the trees C L and C R . While encoding trees, terminal and non-terminal nodes are handled in the same way as in the original Tree-LSTM model (see 2.3).
It should be noted that we collect hidden states of all the nodes (N L and N R ) individually in C L and C R during the encoding process. Hence, hidden states matrix will have dimension N L x d for tree C L whereas for tree C R , it will have dimension N R x d; where d is dimension of each hidden state. Therefore, attention matrix will have a dimension N L x N R . Finally, we employ Equations 14-18 to compute the final representations of sequences L and R.

Progressive Attention (PA)
In this section, we propose a novel attention mechanism on Tree-LSTM, inspired by (Quirk et al., 2005) and (Yamada and Knight, 2001).

PA on dependency structure
Let's assume a dependency tree structure of sentence L = (l 1 , l 2 ....l len L ) is available as D L ; where len L represents number of nodes in D L . Similarly, tree D R corresponds to the sentence R = (r 1 , r 2 ....r len R ); where len R represents number of nodes in D R .
In PA, the objective is to produce the final vector representation of tree D R conditional on the hidden state vectors of all nodes of D L . Similar to the encoding process in NMT, we encode R by attending each node of D R to all nodes in D L . Let's name this process Phase1. Next, Phase2 is performed where L is encoded in the similar way to get the final vector representation of D L .
Referring to Figure 5 and assuming Phase1 is being executed, a hidden state matrix, H L , is obtained by concatenating the hidden state vector of every node in tree D L , where the number of nodes in D L = 3. Next, tree D R is processed by calculating the hidden state vector at every node. Assume that the current node being processed is n R2 of D R , which has a hidden state vector, h R2 . Before further processing, normalized weights are calculated based on h R2 and H L . Formally, a pj = sof tmax(tanh(con pj W c + b) * W a ) (21) where: • p, q ∈ {L, R} and q = p • H q ∈ R x x d represents a matrix obtained by concatenating hidden state vectors of nodes in tree D q ; x is len q of sentence q.
• H pj ∈ R x x d represents a matrix obtained by stacking hidden state, h pj , vertically x times.
• con pj ∈ R x x 2d represents the concatenated matrix.
• a pj ∈ R x represents the normalized attention weights at node j of tree D p ; where D p is the dependency structure of sentence p.
• W c ∈ R 2d x d and W a ∈ R d represent learned weight matrices.
The normalized attention weights in above equations provide an opportunity to align the subtree at the current node, n R2 , in D R to sub-trees available at all nodes in D L . Next, a gated mechanism is employed to compute the final vector representation at node n R2 . Formally, where: • h pj ∈ R d represents the final vector representation of node j in tree D p • (x−1) 0 represents column-wise sum Assuming the final vector representation of tree D R is h R , the exact same steps are followed for Phase2 with the exception that the entire process is now conditional on tree D R . As a result, the final vector representation of tree D L , h L , is computed.
Lastly, the following equations are applied to vectors h L and h R , before calculating the angle and distance similarity (see Section 4). where: • h L ∈ R d represents the vector representation of tree D L without attention.
• h R ∈ R d represents the vector representation of tree D R without attention.

PA on constituency structure
Let C L and C R represent constituency trees of L and R, respectively; where C L and C R have total number of nodes N L (> len L ) and N R (> len R ). Additionally, let's assume that trees C L and C R have the same configuration of nodes as in Section 3.1.2, and the encoding of terminal and nonterminal nodes follow the same process as in Section 3.1.2. Assuming we have already encoded all N L nodes of tree C L using Tree-LSTM, we will have the hidden state matrix, H L , with dimension N L x d. Next, while encoding any node of C R , we consider H L which results in an attention vector having shape N L . Using Equations 19-22 4 , we retrieve the final hidden state of the current node. Finally, we compute the representation of sentence R based on attention to sentence L. We perform Phase2 with the same process, except that we now condition on sentence R. In summary, the progressive attention mechanism refers to all nodes in the other tree while encoding a node in the current tree, instead of waiting till the end of the structural encoding to establish cross-sentence attention, as was done in the decomposable attention model.

Semantic Relatedness for Sentence Pairs
In SemEval 2012, Task 6 and SemEval 2014, Task 1, every sentence pair has a real-valued score that depicts the extent to which the two sentences are semantically related to each other. Higher score implies higher semantic similarity between the two sentences. Vector representations h L and h R are produced by using our Modified Decomp-Attn or Progressive-Attn models. Next, a similarity score,ŷ between h L and h R is computed using the same neural network (see below), for the sake of fair comparison between our models and the original Tree-LSTM (Tai et al., 2015). where: • r T = [1, 2..S] • h x ∈ R d measures the sign similarity between h L and h R • h + ∈ R d measures the absolute distance between h L and h R Following (Tai et al., 2015), we convert the regression problem into a soft classification. We also use the same sparse distribution, p, which was defined in the original Tree-LSTM to transform the gold rating for a sentence pair, such that y = r T p andŷ = r Tp θ ≈ y. The loss function is the KLdivergence between p andp: • m is the number of sentence pairs in the dataset.

Paraphrase Detection for Question Pairs
In this task, each question pair is labeled as either paraphrase or not, hence the task is binary classification. We use Eqs. 25 -28 to compute the predicted distributionp θ . The predicted label,ŷ, will be:ŷ = arg max ypθ (31) The loss function is the negative log-likelihood:

Semantic Relatedness for Sentence Pairs
We utilized two different datasets: • The Sentences Involving Compositional Knowledge (SICK) dataset (Marelli et al. (2014)), which contains a total of 9,927 sentence pairs. Specifically, the dataset has a split of 4500/500/4927 among training, dev, and test. Each sentence pair has a score S ∈ [1,5], which represents an average of 10 different human judgments collected by crowd-sourcing techniques.
• The MSRpar dataset (Agirre et al., 2012), which consists of 1,500 sentence pairs. In this dataset, each pair is annotated with a score S ∈ [0,5] and has a split of 750/750 between training and test.
We used the Stanford Parsers (Chen and Manning, 2014;Bauer) to produce dependency and constituency parses of sentences. Moreover, we initialized the word embeddings with 300dimensional Glove vectors (Pennington et al., 2014); the word embeddings were held fixed during training. We experimented with different optimizers, among which AdaGrad performed the best. We incorporated a learning rate of 0.025 and regularization penalty of 10 −4 without dropout.

Paraphrase Detection for Question Pairs
For this task, we utilized the Quora dataset (Iyer;Kaggle, 2017). Given a pair of questions, the objective is to identify whether they are semantic duplicates. It is a binary classification problem where a duplicate question pair is labeled as 1 otherwise as 0. The training set contains about 400,000 labeled question pairs, whereas the test set consists of 2.3 million unlabeled question pairs. Moreover, the training dataset has only 37% positive samples; average length of a question is 10 words. Due to hardware and time constraints, we extracted 50,000 pairs from the original training while maintaining the same positive/negative ratio. A stratified 80/20 split was performed on this subset to produce the training/test set. Finally, 5% of the training set was used as a validation set in our experiments.
We used an identical training configuration as for the semantic relatedness task since the essence of both the tasks is practically the same. We also performed pre-processing to clean the data and then parsed the sentences using Stanford Parsers. Table 1 summarizes our results. According to (Marelli et al., 2014), we compute three evaluation metrics: Pearson's r, Spearman's ρ and Mean Squared Error (MSE). We compare our attention models against the original Tree-LSTM (Tai et al., 2015), instantiated on both constituency trees and dependency trees. We also compare earlier baselines with our models, and the best results are in bold. Since Tree-LSTM is a generalization of Linear LSTM, we also implemented our attention models on Linear Bidirectional LSTM (Bi-LSTM). All results are average of 5 runs. It is witnessed that the Progressive-Attn mechanism combined with Constituency Tree-LSTM is overall the strongest contender, but PA failed to yield any performance gain on Dependency Tree-LSTM in either dataset. Table 2 summarizes our results where best results are highlighted in bold within each category. It should be noted that Quora is a new dataset and we have done our analysis on only 50,000 samples. Therefore, to the best of our knowledge, there is no published baseline result yet. For this task, we considered four standard evaluation metrics: Accuracy, F1-score, Precision and Recall. The Progressive-Attn + Constituency Tree-LSTM model still exhibits the best performance by a small margin, but the Progressive-Attn mechanism works surprisingly well on the linear bi-LSTM. Table 3 illustrates how various models operate on two sentence pairs from SICK test dataset. As we can infer from the table, the first pair demonstrates an instance of the active-passive voice phenomenon. In this case, the linear LSTM and vanilla Tree-LSTMs really struggle to perform.  However, when our progressive attention mechanism is integrated into syntactic structures (dependency or constituency), we witness a boost in the semantic relatedness score. Such desirable behavior is consistently observed in multiple activepassive voice pairs. The second pair points to a possible issue in data annotation. Despite the presence of strong negation, the gold-standard score is 4 out of 5 (indicating high relatedness). Interestingly, the Progressive-Attn + Dependency Tree-LSTM model favors the negation facet and outputs a low relatedness score.

Discussion
In this section, let's revisit our research questions in light of the experimental results. First, can attention mechanisms be built for Tree-LSTMs? Does it work? The answer is yes. Our novel progressive-attention Tree-LSTM model, when instantiated on constituency trees, significantly outperforms its counterpart without attention. The same model can also be deployed on sequences (degenerated trees) and achieve quite impressive results.
Second, the performance gap between the two attention models is quite striking, in the sense that the progressive model completely dominate its decomposable counterpart. The difference between the two models is the pacing of attention, i.e., when to refer to nodes in the other tree while encoding a node in the current tree. The progressive attention model garners it's empirical superiority by attending while encoding, instead of waiting till the end of the structural encoding to establish cross-sentence attention. In retrospect, this may justify why the original decomposable attention model in (Parikh et al., 2016) achieved competitive results without any LSTM-type encoding. Effectively, they implemented a naive version of our progressive attention model. Third, do structures matter/help? The overall trend in our results is quite clear: the tree-based models exhibit convincing empirical strength; linguistically motivated structures are valuable. Admittedly though, on the relatively large Quora dataset, we observe some diminishing returns of incorporating structural information. It is not counter-intuitive that the sheer size of data can possibly allow structural patterns to emerge, hence lessen the need to explicitly model syntactic structures in neural architectures.
Last but not least, in trying to assess the impact of attention mechanisms (in particular the progressive attention model), we notice that the extra mileage gained on different structural encodings is different. Specifically, performance lift on Linear Bi-LSTM > performance lift on Constituency Tree-LSTM, and PA struggles to see performance lift on dependency Tree-LSTM. Interestingly enough, this observation is echoed by an earlier study (Gildea, 2004), which showed that tree-based alignment models work better on con-stituency trees than on dependency trees.
In summary, our results and findings lead to several intriguing questions and conjectures, which call for investigation beyond the scope of our study: • Is it reasonable to conceptualize attention mechanisms as an implicit form of structure, which complements the representation power of explicit syntactic structures?
• If yes, does there exist some trade-off between the modeling efforts invested into syntactic and attention structures respectively, which seemingly reveals itself in our empirical results?
• The marginal impact of attention on dependency Tree-LSTMs suggests some form of saturation effect. Does that indicate a closer affinity between dependency structures (relative to constituency structures) and compositional semantics (Liang et al., 2013)?
• If yes, why is dependency structure a better stepping stone for compositional semantics? Is it due to the strongly lexicalized nature of the grammar? Or is it because the dependency relations (grammatical functions) embody more semantic information?

Conclusion
In conclusion, we proposed a novel progressive attention model on syntactic structures, and demonstrated its superior performance in semantic relatedness tasks. Our work also provides empirical ingredients for potentially profound questions and debates on syntactic structures in linguistics.