Intra-Correlation Encoding for Chinese Sentence Intention Matching

Sentence intention matching is vital for natural language understanding. Especially for Chinese sentence intention matching task, due to the ambiguity of Chinese words, semantic missing or semantic confusion are more likely to occur in the encoding process. Although the existing methods have enriched text representation through pre-trained word embedding to solve this problem, due to the particularity of Chinese text, different granularities of pre-trained word embedding will affect the semantic description of a piece of text. In this paper, we propose an effective approach that combines character-granularity and word-granularity features to perform sentence intention matching, and we utilize soft alignment attention to enhance the local information of sentences on the corresponding levels. The proposed method can capture sentence feature information from multiple perspectives and correlation information between different levels of sentences. By evaluating on BQ and LCQMC datasets, our model has achieved remarkable results, and demonstrates better or comparable performance with BERT-based models.


Introduction
As a branch of sentence semantic matching (SSM), sentence intention matching (SIM) is critical to question answering systems in certain applications. In general, SSM is to judge whether two sentences express the same meaning. However, in a question answering system, SIM intends to determine whether two questions share the same intention and could be addressed by the same answer, which is more challenging than other SSM tasks. As an example shown in Table 1, although both sentences in Q1 and Q2 share similar intention in fact, it is difficult to distinguish whether they are similar at the semantic intention level without considering the deep context.
With the development of deep learning, a series of SSM models are proposed for semantic matching tasks (Wang et al., 2017;Gong et al., 2018;Huang et al., 2019;Liu et al., 2020). However, these models simply consider the characteristics of the semantic level of the text but overlook the deep intentional features. Researchers have attempted to extract deeper semantic features through attention mechanisms (Tan et al., 2018;Tay et al., 2018), memory networks (Cheng et al., 2016), as well as the addition of external syntactic structures and lexical datasets as in WordNet . Although the above methods obtain deep semantic features from different perspectives, they cannot completely overcome the feature missing phenomenon in the encoding process. Especially due to the diversity of Chinese semantic features, the above existing methods cannot better capture complicated deep semantic features. In English SSM tasks, Wang et al. and Gong et al. employ multi-granularity fusion to extract the corresponding richer semantic features, where the more fine-grained character embeddings are employed together with the traditional word embeddings (Wang et al., 2017;Gong et al., 2018). Although the introduction of character embeddings is beneficial to enrich English text representation, one single English character is hard to express a special meaning. Different from English, a Chinese character is able to represent a solid meaning, which can convey more semantic features and information. Thus, there should be great interest and potential to explore and combine multi-granularity embeddings for Chinese SSM tasks. Huang et al. (Huang et al., 2017) and Zhang et al. (Zhang et al., 2020) achieve better performance by combining character, word, and other granularities to obtain semantic encoding features from Chinese text. The multigranularity fusion method can be applied to extract the semantic features of a text sequence, which can effectively alleviate the phenomenon of missing semantic features in the encoding process.
In this paper, inspired by the existing work, we push forward this line of research by proposing a better multi-granularity fusion approach to capture semantic features from text sequences. In (Huang et al., 2017) and (Zhang et al., 2020), the encoding features in multi-perspective granularities are integrated to generate the final text encodings. However, the correlation and distinction between text features on different granularities are not considered, thus the corresponding semantic features are not further explored in these works. Sequential inference models based on chain LSTMs are implemented in , enabling the capture of more features from different perspectives. They incorporate syntactic parsing information in tree LSTM into the classic BiLSTM model with the help of soft alignment attention. In our work, in order to better extract the correlations between different granularities in SIM, inspired by the work in , we employ soft alignment attention to enhance local information representation between different granularities and capture more sentence correlation.
Our contributions are summarized as follows: • We propose a novel sentence intention matching model, named intra-correlation encoding model (ICE), to better extract sentence intention features. It can capture sentence feature information from multiple perspectives and the correlation information between sentences on character-granularity and word-granularity.
• We propose a novel deep neural architecture for sentence intention matching task, which includes a multi-granularity embedding layer, an intra-correlation encoding layer, a global inference composition layer, and a prediction layer. Our source code is publicly available on GitHub 1 . This work may provide a new reference for researchers in the NLP community.
The rest of the paper is structured as follows. We describe our novel sentence intention matching model in Section 2. Section 3 demonstrates the experimental results. Related work is introduced in Section 4, followed by conclusions in Section 5. Zoom-in of the basic encoding component: Figure 1: Architecture of our intra-correlation encoding model.

Model
We propose a multi-granularity (character-granularity and word-granularity) neural sentence model, whose architecture is shown in Figure 1. We utilize the siamese network structure in the SIM task. Our model architecture consists of four parts: a multi-granularity embedding layer, an intra-correlation encoding layer, a global inference composition layer, and a prediction layer. In the following subsections, we first describe the initial representation, including word-granularity and character-granularity embeddings, then introduce the intra-correlation encoding layer in detail. Next, we describe the global inference composition layer, which measures the feature information between two sentence representations. Finally, the prediction layer is introduced, which predicts whether the corresponding sentences match each other in semantic intention.
We employ different segmentation methods to segment Q1 for character and word 2 , and obtain the multi-granularity sentence representation of character-based Q1 and word-based Q1. An example is shown in Table 2. Sentence Q2 is processed in the same way. The character-based and word-based sentences are padded to the same length N. Corresponding embedings at the character and word levels are obtained by pre-training Word2Vec (Mikolov et al., 2013) on the target dataset, such as BQ or LCQMC in our experiments.

Intra-Correlation Encoding Layer
LSTM and BiLSTM are utilized to encode input sentences Q1 and Q2 at their character and word level granularities, respectively, as shown in Equations 1 and 2.
where, we utilize q1 c n and q1 w m to represent the hidden (output) state generated by the basic encoding module for the n-th character and m-th word, respectively. The same is applied to q2 c n and q2 w m . In this way, we generate multi-granularity representations of encoding features for two sentences with the basic encoding components.
As shown in Equation 1 for character-based Q1, the LSTM Layer is the first layer in the basic encoding components after the multi-granularity embedding layer. Next, the concatenated outputs from the multi-granularity embedding layer (Q1 c ) and the LSTM layer (LSTM(Q1 c )) flow to the BiLSTM layer. Finally, the outputs of the BiLSTM layer (BiLSTM([LSTM(Q1 c ), Q1 c ])) and the LSTM layer (LSTM(Q1 c )) are combined as the final feature representation. Our model follows a siamese network structure, which applies the same encoding method to word-based Q1, word-based Q2, and character-based Q2 as shown in Equations 1 and 2.
In order to capture the correlation information between different granularities of the same sentence, we employ the soft alignment attention work of Chen et al. on the text semantic matching task . They utilize the attention mechanism to compute the attention weights as the similarity of a hidden state tuple between a premise and a hypothesis.
From this inspiration, we compute the attention weights eq1 nm as the similarity of a hidden state tuple <q1 c n , q1 w m > for Q1 between character-granularity and word-granularity, as shown in Equation 3, where q1 c n and q1 w m are computed earlier in Equations 1 and 2. The same is applied to Q2, as shown in Equation 4. In this way, we can obtain the text feature correlation between different granularities and extract more abundant semantic features.
Through the above Equations 3 and 4, we obtain the correlation attention weights for sentence features on different granularities, i.e., eq1 nm and eq2 nm . For the hidden state of the n-th character in character-based Q1, i.e., q1 c n , its correlation semantics in the word-based Q1 is identified based on eq1 nm , as shown in Equation 5.
where q1 c n is a weighted summation of {q1 w m } N m=1 . Intuitively, the content in {q1 w m } N m=1 that is relevant to q1 c n will be selected and represented as q1 c n . The same is performed for each word represented in the word-based Q1 with Equation 6.
Using Equations 3 -6, we obtain the correlation feature expressions q1 c n and q1 w m for sentence Q1. Similarly, we can obtain the correlation features for sentence Q2, using: With the above operations, we generate the sentence feature representations q1 for sentences Q1 and Q2. In addition, we also generate the feature representations of correlations q1 ) between multi-granularity sentences.

Global Inference Composition Layer
Thus far we have obtained a series of basic and correlation encoding feature representations through the intra-correlation encoding layer. We now apply average and max pooling operations on them and obtain the final feature representations for Q1 (q1 c avg , q1 w avg , q1  Through the above Equations 8 -11, we can employ average and max pooling to obtain highorder feature representations from the basic and correlation feature representations of different granularities for sentences Q1 and Q2. Conceptually speaking, the average and max pooling are able to extract a set of global and key features, respectively.
Using the outputs of the above pooling operations, we can now generate the final sentencelevel representations. First of all, for sentence Q1, we generate its final feature representation by combining all its feature representations, as shown in Equation 12. Similarly, the feature representation of sentence Q2 is generated with Equation 13. Next, we can generate the multigranularity correlation feature representations for sentences Q1 and Q2 respectively, as shown in Equation 14.
In addition, we utilize the final semantic representations (f 1 and f 2 ) of sentences Q1 and Q2 to obtain their interactions, using the following operations: where × denotes the element-wise multiplication. Finally, we concatenate these interactions to generate the final representation of multigranularity correlation with Equation 16, which is transferred to the prediction layer. (16)

Prediction Layer
The prediction module is a multi-layer perceptron (MLP) classifier. It has three dense sub-layers, where the first two dense layers are activated with the ReLU function (Nair and Hinton, 2010) and the last dense layer is connected with the sigmoid activation function in our experiments.

Loss Function
For training, we utilize the modified binary cross-entropy loss (Su, 2017). In our notation, y true is the actual label of a training sample and y pred is the corresponding predicted label. For the convenience of comparison, the traditional binary cross-entropy is given in Equation 17: In order to improve its performance, the unit step function θ (x) (defined in Equation 18) and threshold m (set as 0.7) are introduced. The newly modified binary cross-entropy loss is defined in Equation 19. With this novel loss function, the model will be forced to focus on the indistinguishable training samples, which makes the classification perform better.

Datasets
We conduct experiments on two Chinese sentence intention matching data sets, i.e., BQ and LCQMC. BQ is a Chinese bank question pair data set for sentence intention equivalence identification, which is a classic intention matching task . LCQMC is a generic corpus mainly for intention matching collected from Baidu Knows . The two datasets consist of a large set of instances in the form of (Q1, Q2, Label), where Q1 and Q2 are two Chinese sentences, and Label is the label indicating whether Q1 and Q2 share the same semantic intention. A summary of these data sets is provided in Table 3.

Parameter Settings
In our experiments, the embedding dimension is 300 in the multi-granularity embedding layer. The encoding dimension is set as 300 in the intra-correlation encoding layer. For the LSTM layer, dropout (Srivastava et al., 2014) rates of 0.5 and 0 are used for BQ and LCQMC respectively. In the BiLSTM layer, the dropout rates are set to 0.52 and 0.25 respectively for BQ and LCQMC. Dropout rate of 0.5 is used in the prediction component which actually consists of two densely connected hidden layers with 600 units in each layer and one classification output node with a sigmoid activation function. Adam with default parameters is adopted as the optimizer (Kingma and Ba, 2015). All the experiments are executed on a Thinkstation P910 workstation equipped with dual Xeon E5-2600 processors, 192 GB memory, and one Nvidia 2080Ti GPU.

Experimental Results
A comparison of our work with the baseline methods is shown in Table 4. We can observe that the performance of our model (ICE) is superior to all the compared methods in terms of most measures. There are three choices for pre-trained embeddings. The first one is word embedding with Word2Vec (Mikolov et al., 2013), the other is word embedding with GloVe (Pennington et al., 2014), and the last one is Glyce embedding which is first applied in Chinese tasks . The differences among them lie in that Word2Vec is a predictive model, GloVe is a count-based model, and Glyce generates glyph-vectors for Chinese character representations. Predictive models learn their embeddings in order to improve their predictive ability; count-based models learn their embeddings by essentially reducing dimension on the co-occurrence matrix; and Glyce, just like word embeddings, provides a general way to model character semantics in logographic languages. Similar to experiments using BiLSTM, BiMPM, DFF, and MGF, our model utilizes the pre-trained word embedding from Word2Vec. Compared with BiLSTM, BiMPM, and DFF, our model dramatically outperform them. This is probably because the three compared methods only consider one specific granularity, i.e., character or word, which is inadequate to capture enough features. Different from them, our model considers multi-granularity features to encode sentences, which can provide more effective information. In comparison with MSEM and MGF, our model performs better in terms of F 1 -score and accuracy. Although MSEM and MGF consider concatenating word and character embeddings together to generate the final text representation, they do not capture the correlation features between different granularities, which leads to limited performance improvement. Besides, MSEM utilizes GloVe embeddings, which does not improve the performance. Compared with BIMPM+Glyce, even though Glyce has achieved good results on other Chinese language tasks, our model also outperforms it in current task.

Comparison with BERT
Compared with BERT-based methods, our model performed comparably as reported in Table  5. BERT utilizes context information of characters to extract features, and dynamically adjusts embeddings of characters according to different contexts, which solves the polysemy problem suffering Word2Vec and thus helps achieve outstanding performances (Devlin et al., 2019). According to Table 5, our model surpasses BERT-based models on LCQMC and works comparably with them on BQ. This is probably because our model implements the intra-correlation encoding component, which enables us to capture sentence feature information from multiple perspectives and correlation information between sentences on different granularities. Moreover, in contrast to BERT-based approaches, our model is more concise and requires less computing power. We employ the #FLOPs to further evaluate the model. #FLOPs (number of floating-point operations) is a measure of the computational complexity of models, which indicates the number of floating-point operations that the model performs for a single process. Generally speaking, the bigger the model's #FLOPs is, the longer the inference time will be. With the same accuracy, models with lower #FLOPs are more efficient. When our graphics processing card resources are insufficient, ICE has much lower #FLOPs than BERT-based approaches but achieves comparable results.

Effectiveness of Modified Loss Function
In this section, we verify the effectiveness of the modified binary cross-entropy (BCE) loss. As shown in Table 6, compared with the traditional binary cross-entropy defined in Equation 17, the modified binary cross-entropy loss function in Equation 19 has achieved better performance. This corroborates that the modified BCE loss function is more effective to allow the model to focus on the indistinguishable training samples, making the classification perform better.

Related Work
Sentence intention matching is critical for a series of downstream tasks, such as information retrieval, question answering, and machine translation.
With the development of deep neural networks, sophisticated models for SSM task have been evolving rapidly Lai et al., 2019;Huang et al., 2019;Liu et al., 2020). The companion of attention mechanism with sequence models have achieved promising performance in machine translation (Bahdanau et al., 2015) and then is applied in many other tasks in natural language processing. It has also rendered encouraging effects in the SSM task (Wang et al., 2015;Wang et al., 2017;Tay et al., 2018;Duan et al., 2018;Kim et al., 2019). Wang et al. propose a multi-angle bidirectional attention mechanism in SSM task, and the effect of the model is remarkable (Wang et al., 2017). In addition, Sha et al. put forward an attention mechanism by repeating two sentences to improve the model's memory and obtain better textual semantic representation (Sha et al., 2016). Generally speaking, through using attention mechanisms, key feature representations in the text can be captured and benefit precise matching.
Although the above methods extract key feature representations with the attention mechanisms or introduce external syntactic information in the text sequences, these methods only achieve limited improvement. A number of researchers discover that the granularity of text is also crucial for capturing deep semantic features of the text. In particular, Huang et al. propose a word representation layer, which consists of word embedding and character representation, to capture multi-granularity feature representations (Huang et al., 2017). The acquisition and integration of text features at different granularities are considered in (Zhang et al., 2020) and achieve interesting results. However, their work simply integrates multi-granularity features, without taking into account the correlation of text features between different granularities.
The pre-trained language model BERT, which solves the polysemy problem of Word2Vec, has proven to be highly effective (Devlin et al., 2019). However, BERT is often computationally expensive in many practical scenarios. Thus, it is hard to be readily implemented with limited resources. Therefore, a series of smaller, faster, cheaper, and lighter of pre-trained BERT-based models emerge widely, such as DistilBERT (Sanh et al., 2019) and FastBERT (Liu et al., 2020). These models are optimized in terms of running speed and resource utilization, which inevitably reduces the effectiveness of the original BERT model. How to achieve comparable performance with BERT and require less computational resources is critical and urgent in model design for NLP applications.
In this paper, in order to better capture correlation features between different granularities, we propose an intra-correlation encoding framework for SIM task, which considers the correlation between text features from character-granularity and word-granularity. With less requirement on computational resources, our proposed model can achieve better or comparable performance with the state-of-the-art BERT.

Conclusions
For sentence intention matching tasks, we propose a novel method, named the intra-correlation encoding model. It combines character-granularity and word-granularity features to model sentence intention, and utilizes soft alignment attention to enhance the local information of sentences on the different levels. It can capture sentence feature information from multiple perspectives and correlation information between different levels of sentences. Experiments on two datasets demonstrate that our model outperforms non-BERT-based models and achieves at least comparable accuracy with BERT-based models, but runs much more efficiently than BERT. In the future, we would attempt to join multi-granularity embeddings and BERT together, so as to further improve the performance. The generalization of our model to other languages will also be investigated.