Phrase-level Self-Attention Networks for Universal Sentence Encoding

Universal sentence encoding is a hot topic in recent NLP research. Attention mechanism has been an integral part in many sentence encoding models, allowing the models to capture context dependencies regardless of the distance between the elements in the sequence. Fully attention-based models have recently attracted enormous interest due to their highly parallelizable computation and significantly less training time. However, the memory consumption of their models grows quadratically with the sentence length, and the syntactic information is neglected. To this end, we propose Phrase-level Self-Attention Networks (PSAN) that perform self-attention across words inside a phrase to capture context dependencies at the phrase level, and use the gated memory updating mechanism to refine each word’s representation hierarchically with longer-term context dependencies captured in a larger phrase. As a result, the memory consumption can be reduced because the self-attention is performed at the phrase level instead of the sentence level. At the same time, syntactic information can be easily integrated in the model. Experiment results show that PSAN can achieve the state-of-the-art performance across a plethora of NLP tasks including binary and multi-class classification, natural language inference and sentence similarity.


Introduction
Following the success of word embeddings (Bengio et al., 2003;Mikolov et al., 2013), one of NLP's next challenges has become the hunt for universal sentence encoders. The goal is to learn a general-purpose sentence encoding model on a large corpus, which can be readily transferred to other tasks. The learned sentence representations are able to generalize to unseen combination of words, which makes them highly desirable for downstream NLP tasks, especially for those with relatively small datasets.
Previous models for sentence encoding typically rely on Recurrent Neural Networks (RNNs) (Hochreiter and Schmidhuber, 1997;Chung et al., 2014) or Convolutional Neural Networks (CNNs) (Kalchbrenner et al., 2014;dos Santos and Gatti, 2014;Kim, 2014;Mou et al., 2016) to produce context-aware representation. RNNs encode a sentence by reading words in sequential order, they are capable of learning long-term dependencies but are hard to parallelize and not time-efficient. CNNs focus on local or positioninvariant dependencies but do not perform well on many tasks (Shen et al., 2017).
Fully attention-based neural networks have attracted wide interest recently, because they can model both dependencies while being more parallelizable and requiring significantly less time to train. Vaswani et al. (2017) proposed the multihead attention to project a sentence to multiple semantic subspaces, then apply self-attention in each subspace and concatenate the attention results. Shen et al. (2017) proposed the directional self-attention, they apply forward and backward masks to the alignment score matrix to encode temporal order information, and computed attention at feature level to select the features that can best describe the word's meaning in given context. Effective as their models are, the memory required to store the alignment scores of all the token pairs grows quadratically with the sentence length. Furthermore, the syntactic property that is intrinsic to natural language is not considered at all.
Language is inherently tree structured, and the meaning of a sentence comes largely from composing the meanings of subtrees (Chomsky, 1957). Previous syntactic tree-based sentence encoders (Socher et al., 2013;Tai et al., 2015) mainly rely on recursive networks. Although the composition-ality can be explicitly modeled, their models need expensive recursion computation and are hard to be trained by batched gradient descent methods.
In this paper, we propose the Phrase-level Self-Attention Networks (PSAN), for RNN/CNN-free sentence encoding, it inherits all the advantages of fully attention-based models while requires much less memory consumption. In addition, syntactic information can be incorporated into the model more easily. In our model, every sentence is split into multiple phrases based on parse tree, selfattention is performed at the phrase level instead of the sentence level, thus the memory consumption reduces rapidly as the number of phrases increases. Furthermore, a gated memory component is employed to refine word representations hierarchically by incorporating longer-term context dependencies. As a result, syntactic information can be integrated into the model without expensive recursion computation. At last, multi-dimensional attention is applied on the refined word representations to obtain the final sentence representation.
Following Conneau et al. (2017), we trained our sentence encoder on the SNLI (Bowman et al., 2015) dataset, and evaluate the quality of the obtained universal sentence representations on a wide range of transfer tasks. The SNLI dataset is extremely suitable for training sentence encoders because it is the largest high-quality humanannotated dataset that involves reasoning about the semantic relationships within sentences.
The main contributions of our work can be summarized as follows: • We propose the Phrase-level Self-Attention mechanism (PSA) for contextualization. The memory consumption can be reduced because self-attention is performed at the phrase level instead of the sentence level.
• A gated memory updating mechanism is proposed to refine each word representation hierarchically by incorporating different levels of contextual information along the parse tree.
• Our proposed PSAN model outperforms the state-of-the-art supervised sentence encoders on a wide range of transfer tasks with significantly less memory consumption.

Proposed Model
In this section, we introduce the Phrase-level Self-Attention Networks (PSAN) for sentence encod-ing. A phrase is a group of words that carry a specific idiomatic meaning and function as a constituent in the syntax of a sentence. Words in a phrase are syntactically and semantically related to each other. Therefore, it can be advantageous to learn a context-aware representation inside a phrase while filtering out information from outside the phrase using self-attention mechanism. In an attempt to better utilize the tree structure which is intrinsic to language, we propose the gated memory updating mechanism to combine different levels of context information. At last, an attention mechanism is utilized to summarize all the token representations into a fixed-length sentence vector.

Phrase Division
The phrase structure organizes words into nested constituents which can be successively divided into their parts as we move down the constituencybased parse trees. One phrase division shows only one aspect of context dependency. In order to capture different levels of context dependencies, we can split a sentence at different granularities. The number of levels T is a hyper-parameter to be tuned. We can break down the nodes at T different layers in the parse tree to capture T levels of context dependencies 1 , as illustrated in Figure 1.

Phrase-level Self-Attention
This is the core component of our model. It aims to learn a context-aware representation for each token inside a phrase. In order to filter out information that is semantically or syntactically distant, self-attention is performed at the phrase level instead of the sentence level.
Similar to directional self-attention network (DiSAN) (Shen et al., 2017), Phrase-level Self-Attention uses multi-dimensional attention to compute the alignment score for each dimension of token embedding. Therefore, it can select the features that can best describe a word's specific meaning in any given context. Given a phrase P ∈ R l×d represented as a sequence of word embeddings [p 1 , . . . , p l ], where l is the length of the phrase and d is the dimension of word embedding representation, we first compute the alignment score for each token pair in the  Figure 1: An example of phrase division, the sentence and its parse tree are from the SNLI training data. The division is started from the root of a parse tree. In this example, a phrase will not be further divided if it contains 3 or less words. phrase: where σ (·) is an activation function, W a1 , W a2 ∈ R d×d and b a ∈ R d are parameters to be learned, and M is a diagonal-diabled mask (Hu et al., 2017) that aims to prevent a word from being aligned with itself.
The output of the attention mechanism is a weighted sum of embeddings from all tokens for each token in the phrase: where means point-wise product. Note that the alignment score for each token pair is a vector rather than a scalar in the multi-dimensional attention.
The final output of Phrase-level Self-Attention is obtained by comparing each input representation with its attention-weighted counterpart. We use a comparison function based on absolute difference and element-wise multiplication which was similar to Wang and Jiang (2016). This comparison function has the advantage of measuring the semantic similarity or relatedness of two sequences.
where W c ∈ R d×2d and b a ∈ R d are parameters to be learned. c i is the representation for the i-th word in the phrase that captures local dependencies within the phrase.
At last, we put together the Phrase-level Self-Attention results for non-overlapping phrases from the same phrase division of a sentence. For the t-th phrase division we can get C (t) = [c 1 , . . . , c ls ], the phrase-level self-attention results for the sentence from the t-th layer split, where l s is the sentence length.

Gated Memory Updating
Above describes the Phrase-level Self-Attention (PSA) for one split of the parse tree. The parse tree can be split at different granularities. We propose a novel gated memory updating mechanism to refine each word representation hierarchically with longer-term dependencies captured in a larger granularity. Inspired by the idea of adaptive gate in highway networks (Srivastava et al., 2015), our memory mechanism add a gate to original memory networks (Weston et al., 2014;Sukhbaatar et al., 2015). This gate has the ability to determine the importance of the new input and the original memory in the memory updating.
where W g , W m ∈ R d×2d and b g , b m ∈ R d are parameters to be learned. Note that in order to share representation power and to reduce the number of parameters, the parameters of gated memory updating are shared among different layers.

Sentence Summarization
In this layer, self-attention mechanism is employed to summarize the refined representation of a sentence into a fixed-length vector. The selfattention mechanism can explore the dependencies among tokens within the whole sentence. As a result, global dependencies can also be incorporated in the model.
where W g , W m ∈ R d×d and b g , b m ∈ R d are parameters to be learned. After this step, the refined context-aware sentence representation is compressed into a fixed-length vector.

Experiments
In this section, we conduct a plethora of experiments to study the effectiveness of the PSAN model. Following Conneau et al. (2017), we train our sentence encoder using the SNLI dataset, and evaluate it across a variety of NLP tasks including sentence classification, natural language inference and sentence textual similarity. (Pennington et al., 2014) word embeddings (Common Crawl, uncased) are used to represent words. Following Parikh et al. (2016), out-of-vocabulary words are hashed to one of 128 random embeddings initialized by uniform distribution between (-0.05, 0.05). All the word embeddings remain fixed during training. Hidden dimension d is set to 300. All other parameters are initialized with Glorot normal initialization (Glorot and Bengio, 2010). Activation function σ (·) is ELU (Clevert et al., 2015) if not specified. Minibatch size is set to 16. The number of levels T is fixed to 3 in all of our experiments. The syntactic parse trees of SNLI are provided within the corpus. parse trees for all test corpus are produced by the Stanford PCFG Parser 3.5.2 (Klein and Manning, 2003), the same parser that produced parse trees for the SNLI dataset.

300-dimensional GloVe
To train the model, Adadelta optimizer (Zeiler, 2012) with a learning rate of 0.75 is used on the SNLI dataset. The dropout (Srivastava et al., 2014) rate and L2 regularization weight decay factor γ are set to 0.5 and 5e-5. To test the model, the SentEval toolkit (Conneau and Kiela, 2018) is used as the evaluation pipeline for fairer comparison.

Training Setting
Natural language inference (NLI) is a fundamental task in the field of natural language processing that involves reasoning about the semantic relationship between two sentences, which makes it a suitable task to train sentence encoding models.
We conduct experiments on the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015).
The dataset has 570k human-annotated sentence pairs, each labeled with one of the following pre-defined relationships: Entailment (the premise entails the hypothesis), Contradiction (they contradict each other) and N eutral (they are irrelevant). Following previous work (Bowman et al., 2015;Mou et al., 2016), we remove the instances which annotators can not reach consensus on. In this way we get 549367/9842/9824 sentence pairs for train/validation/test set.
Following the siamese architecture (Bromley et al., 1993), we apply PSAN to both the premise and the hypothesis with their parameters tied. v p and v h are fixed-length vector representations for the premise and the hypothesis respectively. The final sentence-pair representation is formed by concatenating the original vectors with the absolute difference and element-wise multiplication between them: At last, we feed the sentence-pair representation v inp into a two layer feed-forward network and use a sof tmax layer to make the classification. This is the de facto scheme for sentence encoders trained on SNLI. (Mou et al., 2016;Liu et al., 2016;Shen et al., 2017)

Evaluation Setting
To show the modeling capacity and robustness of our proposed model, we evaluate our model across a wide range of tasks that can be solved purely based on the encoded semantics. The set of tasks  Table 1: Statistics of the evaluation datasets. If the output is an integer, it represents the number of classes of the classification task. If the output is an interval, it represents the output range of the regression task. # phrases / sent. represents the average number of phrases per sentence for each layer of phrase division. # words / phrase represents the average number of words per phrase for each layer of phrase division.
was selected based on what appears to be the community consensus regarding the appropriate evaluations for universal sentence representations. To facilitate comparison, we use the same sentence evaluation tool as Conneau et al. (2017) to automate evaluation on all the tasks mentioned in this paper.

Baselines
We compare our model with the following supervised sentence encoders: • BiLSTM-Max (Conneau et al., 2017) is a simple but effective baseline that performs max-pooling over a bi-directional LSTM.
• AdaSent (Zhao et al., 2015) forms a hierarchy of representations from words to phrases and then to sentences through recursive gated local composition of adjacent segments.  • DiSAN (Shen et al., 2017) is composed of a directional self-attention block with temporal order encoded, and a multi-dimensional attention that compresses the sequence into a vector representation.

Overall Performance
Experiment results of our model and four baselines are shown in Table 2. Micro and macro accuracies are two composite indicators for evaluating transfer performance of tasks whose metric is classification accuracy. Macro accuracy is the proportion of true results in the population of instances from all tasks. Micro accuracy is the arithmetic mean of dev accuracies for each task. PSAN achieves the state-of-the-art performance   In Table 3, we compare our model with baseline sentence encoders in each transfer task. PSAN can consistently outperform the baselines in almost every task considered. On the SICK dataset, which can be seen as an out-domain version of SNLI, our model can outperform the baselines by a large margin, demonstrating the semantic relationship learned on the SNLI can be well transfered to other domains. On the STS14 dataset, where sentence vectors can be more directly measured by the cosine distance, our model can also achieve the stateof-the-art performance, indicating that our learned sentence representations are of high quality.

Ablation Study
For thorough comparison, we implement seven extra baselines to analyze the improvements con-tributed by each part of our PSAN model: • PSA on the first/second/third layer only only uses the Phrase-level Self-Attention at the first/second/third layer of phrase division.
• w/o PSA applies self-attention at the sentence level and uses the gated memory updating mechanism to refine each token representation hierarchically.
• w/o syntactic division divides each sentence equally into small blocks, and applies PSA within each block. The number of blocks equals the number of phrases in that layer.
• w/o gated memory updating concatenates the outputs of Phrase-level Self-Attention from three layers of phrase division and feeds the result to a feed-forward layer.
• w/o both applies self-attention at the sentence level, and uses sentence summarization to summarize the attention results into a fixed length vector.
The results are listed in Table 4. We can see that (2) performs best among (1), (2) and (3), demonstrating that the second layer split is more expressive, because the number of words per phrase in the second layer is the most suitable. It is neither too small to capture context dependencies, nor too large to filter out irrelevant noise. (8) outperforms (1), (2) and (3), showing that combining phraselevel information from different granularities can further improve performance.
We also experiment on models where the alignment matrix is calculated at the sentence level or at the syntactic-irrelevant block level. (5) performs quite well, showing that hierarchical refinement on smaller units can bring about reasonable  (4) and (5), demonstrating syntactic information helps in sentence representation.
When comparing (6) with (8), we can tell that gated memory updating is a better method when used to refine token representation along the parse tree. We assume that memory updating resembles the tree structure of language in that larger phrase is composed in the knowledge of how smaller phrases are composed inside it.
Comparing (7) with (1), (2) and (3), we can find that performing self-attention at the phrase level is generally better than at the sentence level, indicating that reducing attention context into phrase level can effectively filter out words that are syntactically and semantically distant, thus focusing on the interaction with important words. Comparing (7) with (4), we can draw the conclusion that memory updating is effective even when the inputs to each layer are the same.

Analysis of Sentence Length
Long-term dependencies are typically hard to capture for sequential models like RNNs (Bengio et al., 1994;Hochreiter and Schmidhuber, 1997). We conduct experiments to see how performance changes as the sentence length increases. In Figure 2, we show the relationship between classification accuracy and the average length of sentence pair on the SNLI dataset. Sentence-level Self-Attention (w/o PSA model described in subsection 4.2) is used as a baseline for our model. PSAN  outperforms Sentence-level Self-Attention model consistently for longer sentences of length 14 to 20. This demonstrates that incorporating syntactic information by performing self-attention at the phrase level and refining each word's representation hierarchically can help to capture long-term dependencies across words in a sentence.

Analysis of Memory Consumption
We conduct experiments to analyze the memory consumption reduction resulted from Phrase-level Self-Attention. To this end, we re-implement two fully attention-based models (Vaswani et al., 2017;Shen et al., 2017) on the TREC dataset. To make fair comparison, the dimensions of sentence vectors are set to 300, the same number as our model. Table 5 lists the results. Our PSAN model can outperform the other two fully attention-based models, while being more memory efficient. reducing more than 20% of memory consumption.

Visualization and Case Study
In order to analyze the attention changing process and the importance of each word in the sentence vector, we visualize the attention scores in the alignment matrix of each layer in Phraselevel Self-Attention and sentence summarization layer. To facilitate the visualization of the multidimension attention vector, we use the l2 norm of the attention vector for representation. In Figure 3, we can see that, the difference in attention weights between semantically important and unimportant words gets larger as the context becomes larger. This implies that token representation can be gradually refined by the gated memory updating mechanism. Furthermore, the alignment matrix of a phrase can be refined even if the phrase division does not change between layers. For instance, the word "girl" gets larger attention weight in the second layer division than in the first layer. This demonstrates that the memory  updating mechanism can gradually pick out important words for sentence representation. Finally, nouns and verbs dominate the attention weights, while stop words like "a" and "its", contribute little to the final sentence representation, this indicates that PSAN can effectively pick out semantically important words that are most representative for the meaning of the whole sentence.

Related Work
Recently, self-attention mechanism has been successfully applied to the field of sentence encoding, it utilizes the attention mechanism to relate elements at different positions from a single sentence. Due to its direct access to each token representation, both long-term and local dependencies can be modeled flexibly. Liu et al. (2016) leveraged the average-pooled word representation to attend words appear in the sentence itself. Cheng et al. (2016) proposed the LSTMN model for machine reading, an attention vector is produced for each of its hidden states during the recurrent iteration, thus empowering the recurrent network with stronger memorization capability and the ability to discover relations among tokens. Lin et al. (2017) obtained a fixed-size sentence embedding matrix by introducing self-attention. Different from the feature-level attention used in our model, their attention mechanism extracted different aspects of the sentence into multiple vector representations, and utilized a penalization term to encourage the diversity of different attention results.
Syntactic information can be useful for understanding a natural language sentence. Many previous researches utilized syntactic information to build sentence encoder from composing the mean-ings of subtrees. Tree-LSTM (Tai et al., 2015; composed its hidden state from an input vector and the hidden states of arbitrarily many child units. In Tree-based CNN (Mou et al., 2015(Mou et al., , 2016, a set of subtree feature detectors slide over the parse tree of a sentence, and a max-pooling layer is utilized to aggregate information along different parts of the tree. Apart from the models that use parse information, there have been several researches that aimed to learn the hierarchical latent structure of text by recursively composing words into sentence representation. Among them, neural tree indexer (Munkhdalai and Yu, 2017b) utilized LSTM or attentive node composition function to construct full n-ary tree for input text. Gumbel Tree-LSTM (Choi et al., 2018) used Straight-Through Gumbel-Softmax estimator to decide the parent node among candidates dynamically. A major drawback of these models is that the recursion computation can be expensive and hard to be processed in batches.

Conclusion
We propose the Phrase-level Self-Attention Networks (PSAN), a fully attention-based model that can utilize syntactic information for universal sentence encoding. By applying self-attention at the phrase level, we can filter out distant and unrelated words and focus on modeling interaction between semantically and syntactically important words, a gated memory updating mechanism is utilized to incorporate different levels of contextual information along the parse tree. Empirical results on a wide range of transfer tasks demonstrate the effectiveness of our model.