Improving Text Understanding via Deep Syntax-Semantics Communication

Recent studies show that integrating syntactic tree models with sequential semantic models can bring improved task performance, while these methods mostly employ shallow integration of syntax and semantics. In this paper, we propose a deep neural communication model between syntax and semantics to improve the performance of text understanding. Local communication is performed between syntactic tree encoder and sequential semantic encoder for mutual learning of information exchange. Global communication can further ensure comprehensive information propagation. Results on multiple syntax-dependent tasks show that our model outperforms strong baselines by a large margin. In-depth analysis indicates that our method is highly effective in composing sentence semantics.

Recent studies show that integrating syntactic tree models with sequential semantic models can bring improved performance for syntax-dependent tasks (Shi et al., 2016;Havrylov et al., 2019), such as semantic role labeling (SRL) (Wang et al., 2019) and natural language inference (NLI) (Chen et al., 2017;, etc. Intuitively, sequential semantic models and syntactic tree models play different roles in text modeling. Sequential semantic models learn the representation via adjacency neighborhood, while syntactic tree models encode texts through structural connections. Taking the two sentence pairs from the NLI task in Figure 1 as example, sentence A and B in example (a) share the same dependency structure but have irrelevant semantics, and tree models are more suitable and effective for capturing the semantic difference than sequential models in this case. In example (b), two sentences convey very similar semantics but have different syntactic structures. Therefore, two types of models should interact closely in learning compositional representations for better understanding of the texts.
However, existing efforts integrate tree and sequential models through a straightforward way such as representations concatenation (Chen et al., 2017;Vashishth et al., 2019) or multi-task learning (Shi et al., 2016;Swayamdipta et al., 2018;, limiting the performance of end tasks. We believe that a better integration can be achieved when adequate interactions between sequential semantic encoder and syntactic tree encoder can take place during learning, improving the performance of end tasks, and also alleviating the long-range dependency problem.
To this end, we present a novel deep syntaxsemantics communication model, as shown in Figure 2. In particular, sequential and dependencytree based submodels are used for encoding input sentence separately. Local communication is performed between each submodel during learning for information exchange of consecutive words in a sentence. Meanwhile, two submodels are considered as an entire unit, taking global propagation at sentence level over recurrent steps. In addition, we employ gate mechanism to control information flow of each node at each time step during global communication.
Experimental results on a wide range of syntaxdependent NLP tasks show that our model outperforms strong baselines by a large margin, offering an alternative for better integration of sequential and tree models. Further analysis indicates that our method is highly effective in composing sentence semantics, verifying the importance of integrating syntax and semantics for text understanding.

Related Work
Neural sequential models have been widely used for encoding texts in the NLP community, due to their effectiveness on capturing semantics. Representative models such as LSTM, GRU, Transformer, ELMo and BERT, have been employed for various NLP tasks, including language modeling (Sundermeyer et al., 2012), machine translation (Bahdanau et al., 2015), question answering , etc. On the other hand, some efforts devote to develop hierarchical tree models such as TreeLSTM and GCN, based on syntactic structures (e.g., dependency tree). Such tree encoders equipped with external syntactic knowledge can bring further improvements for some NLP tasks, especially syntax-dependent ones (Tai et al., 2015;Looks et al., 2017;Zhang and Zhang, 2019;Fei et al., 2020b,a), such as SRL (Swayamdipta et al., 2016;Wang et al., 2019;Fei et al., 2020c), NLI (Chen et al., 2017; and relation classification (Liu et al., 2015;Tran et al., 2019), etc. In recent years, exploring the correlation between syntax and semantics has become a hot research topic. Previous work has shown a strong correlation between syntax and semantics, and proven that integrating syntactic tree models with sequential models could improve the performance of end tasks (Swayamdipta et al., 2016;Shi et al., 2016;Looks et al., 2017;. For example, Shi et al. (2016) simultaneously conducted syntax parsing and semantic role labeling via multi-task training strategy. Swayamdipta et al. (2018) incorporated syntactic features into semantic parsing tasks by multi-task learning. Vashishth et al. (2019) concatenated the contextualized semantic representations with syntactic tree representations for improving the ability of word embeddings. More recently,  added a multi-layer BiLSTM with shortcut connections to the Pairwise Word Interaction model for capturing semantics and syntactic structure of sentences. However, these methods only use shallow integration of syntax and semantics, limiting the performance of end tasks.
Our model is inspired by Zhang et al. (2019), who introduce a novel method allowing the sufficient communication between different tree models for sentiment analysis. Unlike their work, this paper is dedicated to realizing a deep communication between syntactic tree model and sequential semantic model for improving text understanding. The idea of sentence-level propagation in our work is partially related to Zhang et al. (2018), who propose a novel LSTM architecture where a set of global states are used for sentence-level propagation along recurrent steps, rather than incremental reading of a sequence of words in vanilla sequential LSTM. Compared with their model, our model is more effective in composing semantic information of texts.

Model
In this paper, we propose a deep neural communication model between syntax and semantics to improve the performance of text understanding. Local communication is performed between two encoders for information exchange of each node, as illustrated in Figure 3. Global communication is performed over the entire framework under recurrent steps for sufficient information propagation, as shown in Figure 2.

Baseline Encoders
For an input sentence S = {w 1 , · · · , w n } with sequential word representations {x seq 1 , · · · , x seq n } and dependency tree representations {x tree 1 , · · · , x tree n } from dependency parsing, the baseline sequential encoder and the tree encoder generate contextualized representations separately, which can be concatenated as the final node representation.

Sequential Encoder
We use a bidirectional LSTM (BiLSTM) as sequential encoder on learning semantic information, which processes a sentence in forward and backward directions, based on vanilla LSTM model. Considering the forward node representation − → h i from a forward LSTM: (1) where i i , f i , o i and u i are the gates controlling the LSTM cell c i and the state − → h i . W and b are the parameters. σ is the sigmoid function and is the element-wise multiplication. Similarly, a backward LSTM can yield the backward node representation ← − h i over the same input S. BiLSTM takes the concatenation of − → h i and ← − h i as the final node representation for the word w i :

Tree Encoder
We employ the dependency tree as the underlying structure, where all the nodes are input words and connected with directed edges, as the sentences shown in Figure 1. We use two typical tree models for encoding the structure, including TreeLSTM and GCN, both under a bidirectional setting. The standard TreeLSTM encodes each node j with its corresponding head word representation as input x j . For the bottom-up TreeLSTM: where W , U and b are parameters, C(j) is the set of child nodes of j. h j , i j , o j and c j denote the hidden state, input gate, output gate and memory cell of the node j, respectively. f jk is a forget gate for each child k of j. Similarly, the top-down TreeL-STM has the same transitions as the bottom-up counterpart, except for the direction and number of dependent nodes. We use the concatenated representations from two direction for each node: Compared with TreeLSTM, GCN is more computationally efficient, performing tree propagation for each node in parallel with O(1) complexity. Considering the constructed dependency graph G = (V, E), where V are sets of nodes and E are sets of bidirectional edges between heads and dependents, respectively. GCN can be viewed as a hierarchical node encoder, which represents the node j at the k-th layer and encodes the node j as follows: where N (j) denotes neighbors of j. ReLU is a non-linear activation function. We take the final layer's output as the final tree representation h tree Figure 3: Local communication between sequential encoder and tree encoder.

Deep Communication Model
We treat sequential encoder and tree encoder as one entire unit, making use of the representation concatenation of the corresponding nodes as the final node representation h all i : As shown in Figure 2, during the inner-sentence local interaction between semantics and syntax, we meantime scale the whole unit over recurrent steps T for sentence-level global propagation. We denote the unit at recurrent step t as U t = {H seq,t , H tree,t , H all,t }.

Local Interaction
The motivation of local communication is to encourage sequential encoder and tree encoder to learn more from each other's pattern of information propagation. In particular, considering sequential encoder H seq,t and tree encoder H tree,t in U t , and current word w i , as shown in Figure 3. The main idea is to let the nodes in unit U t take their neighbor nodes of both sequential and tree encoder as input at the last time step t − 1, including the nodes in sequential model: , and in tree model: par (par is parent), which all are packed into a set H nbs,t−1 (nbs means neighbors).
First, each node in sequential encoder at current step t takes as an additional input the neighbor nodes from the last time step: where x seq,t i is the neighbor node representation obtained via the attentive operation: where h Similarly, each node in tree encoder takes the additional neighbor node representation as input: where x tree,t j is formulated as: where h nbs,t−1 q ∈ H nbs,t−1 excluding h tree,t−1 j itself.

Sentence-level Global Propagation
During sentence-level propagation across recurrent steps T , information exchange between syntax and semantic in a sentence can be extended sufficiently and broadly, and information flow between consecutive words can be enhanced by capturing longrange dependencies.
We reach the goal by employing a context gate over the final node representation h all,t i . Formally, where h all,t i is the ungated value from the concatenation of h seq,t i and h tree,t i . The context gate c t i for the node w i controls the contribution proportion of history representation and current representation during each step t.

Decoding and Training
We use a softmax classifier as the decoding layer: Technically, for sentence-level classification, the final sentence representation is the attention representation over h all,T i : For sentence pair tasks, such as NLI, we make element-wise production, subtraction, addition and concatenation of two separate sentence representations as a whole: where r a and r b are the corresponding sentence representations.
For sequence-level classification, we directly make use of the final node representations {h all,T 1 , · · · , h all,T n }, followed by a softmax decoder: The main task cross-entropy loss can be represented as: where λ 2 θ 2 is the l 2 regularization term andŷ i is the ground truth label.
To avoid cold-start training, we first pre-train the standalone sequential encoder and the tree encoder separately, and then take their parameters as the initial states for the framework at step 0, including H seq,0 and H tree,0 . Thereafter, we train the entire framework in total N iterations with early stopping strategy.

Experimental Setups
Hyperparameters. For BiLSTM, TreeLSTM and GCN, we all use a 2-layer version. The dimension of word embeddings is set to 300, which is initialized with the pre-trained GloVe embedding (Pennington et al., 2014). All the hidden sizes in neural networks are set to 350. We adopt the Adam optimizer with an initial learning rate in [1e-5, 2e-5, 1e-6], and L2 weight decay of 0.01. We use the mini-batch in [16,32,64] based on the tasks, and apply 0.5 dropout ratio for word embeddings. λ is fine-tuned according to specific tasks.
Task, Dataset and Evaluation. We conduct experiments on typical syntax-dependent tasks. 1) EFP, event factuality prediction on the UW dataset (Lee et al., 2015). EFP evaluates the performance of different methods with Pearson correlation coefficient (r). 2) Rel, relation classification for drug-drug interaction (Segura Bedmar et al., 2013). 3) SRL, semantic role labeling on the CoNLL08 WSJ dataset (Surdeanu et al., 2008). Rel and SRL use the F1 score to measure the performance of different models. 4) NLI, natural language inference, which also can be modeled as a sentence pair classification, and we investigate NLI on three benchmarks: QNLI (Rajpurkar et al., 2016), SICK (Marelli et al., 2014) and RTE (Bentivogli et al., 2009). For NLI, we use the accuracy to evaluate different models by following previous work.
Note that each dataset contains its own training set, development set and test set. We test the performance of our method 30 times on the corresponding test set, and the results are presented after significant test with p≤0.01. We use the state-ofthe-art BiAffine parser (Dozat and Manning, 2017) to obtain the dependency annotation x tree . Being trained on the Penn Treebank (Marcus et al., 1993), the dependency parser has a 93.4% LAS and 95.2% UAS on WSJ test sets.
Baselines. To show the effectiveness of our model, we compare the proposed model with three types of baseline systems.
• Syntactic tree models, including standalone TreeLSTM or GCN encoder introduced in § 3.1.2.
For ensemble models, we concatenate the output representations of tree encoder TreeLSTM and sequential model BiLSTM. For MTL, we use the underlying shared structure for parameter sharing for TreeLSTM and BiLSTM. For the NLI task, we additionally compare the syntax-semantics integration models, including ESIM (Chen et al., 2017), StructAlign  and PWIM .

Experimental Results
Main tasks. Table 1 shows the results of different models on EFP, Rel and SRL tasks. Several observations can be found. First of all, the attentionbased sequential models (e.g., ATTBiLSTM and Transformer) are better than the vanilla BiLSTM model, while the S-LSTM model that incorporates both word-level and sentence-level propagation is more effective in encoding texts, compared with the attention-based sequential models such as ATTBiLSTM and Transformer. Besides, tree models with syntactic structure achieve better performance than sequential semantic models, showing the effectiveness of utilizing external syntactic knowledge for syntax-dependent tasks. In particular, the GCN encoder slightly outperforms the TreeLSTM encoder. In addition, when integrating tree models with sequential networks via ensemble method or multitask learning, the improvements are quite incremental and limited. Even ensemble learning can be worse than standalone tree encoders such as GCN. Finally, our proposed method (including both the TreeLSTM and GCN based encoder) gives the best results (p≤0.01) than all the baselines, demonstrating the importance of an effective integration between syntax and semantics. The results also show the TreeLSTM based tree encoder is more beneficial to our deep syntax-semantics communication model. The possible reason is that TreeLSTM en-  codes syntactic tree structure in an incremental process, during which more detailed information pass can be leveraged. While in GCN, the nodes of syntactic graph is encoded in parallel, though being more computational efficient, offering collapsed information.
NLI tasks. We evaluate different methods on the NLI datasets. As shown in Table 2, similar observations can be found as previous tasks. Among syntactic tree models, TreeLSTM is more effective than GCN for sentences pair encoding, showing the same trends in Table 1. Despite structural architecture of the TreeLSTM encoder, it learns the syntax consecutively, during which more contextual information can be maintained. While GCN encodes the sentence in one shot, it is not sufficient for matching a sentence pair. In addition, we can find that the strong NLI baselines (ESIM, StructAlign and PWIM) give better results than the syntax-semantics ensemble model, as they can provide more sophisticated manner on incorporating syntactic knowledge with semantic composition. Nevertheless, our model outperforms all baseline systems, with an average accuracy 82.0% by the TreeLSTM tree baseline, and 81.2% by the GCN encoder. The above results prove that our framework is highly effective in integrating syntactic structure with sequential semantic models.

Ablation Study
We do ablation tests to analyze the contribution of different components, including tree encoders,   Table 3. First, we replace the GloVe embedding with the one initialized by the Xavier algorithm (Glorot and Bengio, 2010), and we can find that the performance has a significant drop. When we use the state-of-theart language models, such as ELMo and BERT, instead of BiLSTM, we obtain prominent performance gains. This indicates the importance of semantics for the framework. Second, if we abandon the local communication mechanism, the accuracy decreases dramatically. Finally, the context gate c t i and the sentence-level propagation architecture make similar contributions on global communication for the task performance.

Efficiency
We investigate the efficiency of different models on the EFP task. As shown in Figure 4, our method gives competitive performance when the sentence length grows to 30. In contrast, the performance of the other models has been significantly reduced. This indicates that our method can partially relieve the long distance dependency problem, thanks to the sentence-level global communication.
We explore the impact of recurrent steps in sentence-level propagation architecture. As shown in Figure 5, TreeLSTM converges at step 7, while GCN is faster, at step 3. This partially coincides with the principle that GCN is more efficiencysaving than TreeLSTM.

Semantic Composition
We explore to what extent the model better captures the semantics of sentences. Based on the RTE task, we first measure the distance of each sentence pair on semantic representation with the Euclidean distance (Ed), and then scale the continuous valueŷ i into [0,1]. The RTE gold test labels y i ∈ {0, 1}, includes Entailment and Contradict.
We define a semantic deviation as: Dev(y,ŷ) = where Ed is the averaged distance. If all the predicted distances are coincident with gold ones, or different from oracles, Dev=0, indicating the maximum consistency of semantic representation. We make statistics for the deviation of each sentence pair by several baselines, as shown in Figure 6. We can see that our method gives the best semantic consistencies with gold ones, compared with other methods. This indicates the effectiveness of our model on composing sentence semantics.
We also explore the effectiveness of semantic composition by comparing our method with the NLI model StructAlign. We scatter the predicted probability of each sentence pair. Technically, a model is expected to predict the Entailment label (y=1) with larger probability, and vise versa for the Contradict label. As shown in Figure 7 find that our model tends to predict the samples with higher confidence, on either Entailment or Contradict, compared with StructAlign. The above analysis proves that our method is more effective than StructAlign in semantics composition.

Case Study
To better understand the introduced local and global communication mechanisms, we conduct case study by empirically visualizing the edge weights on both syntactic dependency structure encoder and sequential model, and the word weights. The experiment is based on the RTE test set. We first calculate the attention weights for each node in sequential model (Eq. 21) and tree model (Eq. 25), respectively. We then recompute such edge weights for all the nodes via global normalization at sentence level, for sequential model and tree model, separately. By calculating the co-occurrence matrix of edge weights, we can obtain word weights. We visualize the importance of words, dependent edges and consecutive edges for Premise and Hypothesis sentences, respectively. As illustrated in Figure 8, before information exchange (T =0), the weights of dependency edges and consecutive edges are inaccurate and not di-

Coyote got shot after biting girl in Vanier Park T =0
T =5

Girl got shot in park
Girl got shot in park T =0 T =5 Premise Hypothesis Figure 8: Case visualization of words and edges on the RTE task (with Contradict label). The arrows above sentences are bidirectional syntactic dependents, and the ones below sentences are sequential semantics. rectly useful for capturing semantics. Besides, the 'attentions' focused on edges in sequential model and tree model are quite different. When the framework is trained close to convergence, at time step 5, the connections between syntax and semantics tend to be mutually coincident. The possible reason is that sequential semantics can guide syntactic structure learning onto the proper place. For example, syntactic links for biting girl in Premise, and the one for got shot in Hypothesis, are enhanced by the correspondences of sequential edges, respectively. Consequently, more informative words, e.g., biting girl in Premise and Girl got shot in Hypothesis, can receive proper weights for building more accurate semantics. With such semantics composition, the model easily gives correct predictions. This shows that an effective communication can improve mutual learning of syntactic structure and sequential semantic.

Conclusion
We proposed a deep syntax-semantics communication model for improving text understanding. Local communication was performed between syntactic tree encoder and sequential semantic encoder for mutual learning of information exchange. Global communication was performed for ensuring information propagation throughout entire architecture over recurrent steps. Results on multiple tasks showed that our model outperformed strong baselines. In-depth analysis further indicated that our method was highly effective on composing sentence semantics.