Neural Graph Matching Networks for Chinese Short Text Matching

Chinese short text matching usually employs word sequences rather than character sequences to get better performance. However, Chinese word segmentation can be erroneous, ambiguous or inconsistent, which consequently hurts the final matching performance. To address this problem, we propose neural graph matching networks, a novel sentence matching framework capable of dealing with multi-granular input information. Instead of a character sequence or a single word sequence, paired word lattices formed from multiple word segmentation hypotheses are used as input and the model learns a graph representation according to an attentive graph matching mechanism. Experiments on two Chinese datasets show that our models outperform the state-of-the-art short text matching models.


Introduction
Short text matching (STM) is a fundamental task of natural language processing (NLP). It is usually recognized as a paraphrase identification task or a sentence semantic matching task. Given a pair of sentences, a matching model is to predict their semantic similarity. It is widely used in question answer systems and dialogue systems (Gao et al., 2019;Yu et al., 2014).
The recent years have seen advances in deep learning methods for text matching (Mueller and Thyagarajan, 2016;Gong et al., 2017;Chen et al., 2017;Lan and Xu, 2018). However, almost all of these models are initially proposed for English text matching. Applying them for Chinese text matching, we have two choices. One is to take Chinese characters as the input of models. Another is first to segment each sentence into words, and then to take these words as input tokens. Although character-based models can overcome the * Kai Yu is the corresponding author. problem of data sparsity to some degree , the main drawback of these models is that explicit word information is not fully exploited, which can be potentially useful for semantic matching. However, word-based models often suffer some potential issues caused by word segmentation. As shown in Figure 1, the character sequence "南 京 市 长 江 大 桥(South Capital City Long River Big Bridge)" has two different meanings with different word segmentation. The first one refers to a bridge (Segment-1, Segment-2), and the other refers to a person (Segment-3). The ambiguity may be eliminated with more context. Additionally, the segmentation granularity of different tools is different. For example, "长江大 桥(Yangtze River Bridge)" in Segment-1 is divided into two words "长江(Yangtze River)" and "大 桥(Bridge)" in Segment-2. It has been shown that multi-granularity information is important for text matching (Lai et al., 2019).
Here we propose a neural graph matching method (GMN) for Chinese short text matching. Instead of segmenting each sentence into a word sequence, we keep all possible segmentation paths to form a word lattice graph, as shown in Figure 1. GMN takes a pair of word lattice graphs as input and updates the representations of nodes according to the graph matching attention mechanism. Also, GMN can be combined with pre-trained language models, e.g. BERT (Devlin et al., 2019). It can be regarded as a method to integrate word information in these pre-trained language models during the fine-tuning phase. The experiments on two Chinese Datasets show that our model outperforms not only previous state-of-the-art models but also the pre-trained model BERT as well as some variants of BERT.

Problem Statement
First, we define the Chinese short text matching task in a formal way. Given two Chinese sentences S a = {c a 1 , c a 2 , · · · , c a ta } and , the goal of a text matching model f (S a , S b ) is to predict whether the semantic meaning of S a and S b is equal. Here, c a i and c b j represent the i-th and j-th Chinese character in the sentences respectively, and t a and t b denote the number of characters in the sentences.
In this paper, we propose a graph-based matching model. Instead of segmenting each sentence into a word sequence, we keep all possible segmentation paths to form a word lattice graph G = (V, E). V is the set of nodes and includes all character subsequences that match words in a lexicon D. E is the set of edges. If a node v i ∈ V is adjacent to another node v j ∈ V in the original sentence, then there is an edge e ij between them. N f w (v i ) denotes the set of all reachable nodes of node v i in its forward direction, while N bw (v i ) denotes the set of all reachable nodes of node v i in its backward direction.
With two graphs G a = (V a , E a ) and G b = (V b , E b ), our graph matching model is to predict their similarity, which indicates whether the original sentences S a and S b have the same meaning or not.

Proposed Framework
As shown in Figure 2, our model consists of three components: a contextual node embedding module (BERT), a graph matching module, and a relation classifier.

Contextual Node Embedding
For each node v i in graphs, its initial node embedding is the attentive pooling of contextual character representations. Concretely, we first concat the original character-level sentences to form a new sequence S = [SEP]}, and then feed them to the BERT model to obtain the contextual representations for each charater c CLS , c a 1 , · · · , c a ta , c SEP , c b 1 , · · · , c b t b , c SEP . Assuming that the node v i consists of n i consecutive character tokens {c s i , c s i +1 , · · · , c s i +n i −1 } 1 , a feature-wised score vectorû s i +k is calculated with a feed forward network (FNN) with two layers for each character c s i +k , i.e.û s i +k = FFN(c s i +k ), and then normalized with feature-wised multidimensional softmax. The corresponding character embedding c s i +k is weighted with the normalised scores u s i +k to obtain the initial node embedding v i = n−1 k=0 u s i +k c s i +k , where represents element-wise product of two vectors.

Neural Graph Matching Module
Our proposed neural graph matching module is based on graph neural networks (GNNs) (Scarselli et al., 2009). GNNs are widely applied in various NLP tasks, such as text classification (Yao et al., 2019), machine translation (Marcheggiani et al., 2018), Chinese word segmentation , Chinese named entity recognition (Zhang and Yang, 2018), dialogue policy optimization (Chen et al., 2018c(Chen et al., , 2018b, and dialogue state tracking , etc. To the best of our knowledge, we are the first to introduce GNN in Chinese shot text matching. The neural graph matching module takes the contextual node embedding v i as the initial representation h 0 i for the node v i , then updates its representation from one step (or layer) to the next with two sub-steps: message propagation and representation updating.
Without loss of generality, we will use nodes in G a to describe the update process of node representations, and the update process for nodes in G b is similar. Message Propagation At l-th step, each node v i in G a will not only aggregate messages m f w i and m bw i from its reachable nodes in two directions: but also aggregate messages m b1 i and m b2 i from all nodes in graph G b , Here α ij , α ik , α im and α iq are attention coefficients (Vaswani et al., 2017). The parameters W f w and W bw as well as the parameters for attention coefficients are shared in Eq. (1) and Eq.
(2). We define m self With this sharing mechanism, the model has a nice property that, when the two graphs are perfectly matched, we have m self i ≈ m cross i . The reason why they are not exactly equal is that the node v i can only aggregate messages from its reachable nodes in graph G a , while it can aggregate messages from all nodes in G b . Representation Updating After aggregating messages, each node v i will update its representation from h l−1 i to h l i . Here we first compare two messages m self i and m cross i with multi-perspective cosine distance (Wang et al., 2017),  where k ∈ {1, 2, · · · , P } (P is number of perspectives). w cos k is a parameter vector, which assigns different weights to different dimensions of messages. With P distances d 1 , d 2 , · · · , d P , we update the representation of v i , where [·, ·] denotes the concatation of two vectors, FFN is a feed forward network with two layers. After updating node representation L steps, we will obtain the graph-aware representation h L i for each node v i . h L i includes not only the information from its reachable nodes but also information of pairwise comparison with all nodes in another graph. The graph level representations g a and g b for two graphs G a and G b are computed by attentive pooling of representations of all nodes in each graph.

Relation Classifier
With two graph level representations g a and g b , we can predict the similarity of two graphs or sentences, (5) where p ∈ [0, 1]. During the training phase, the training object is to minimize the binary crossentropy loss.

Experimental Setup
Dataset We conduct experiments on two Chinese datasets for semantic textual similarity: LCQMC (Liu et al., 2018) and BQ (Chen et al., 2018a). LCQMC is a large-scale open-domain corpus for question matching, while BQ is a domain-specific corpus for bank question matching. The sample in both datasets contains a pair of sentences and a binary label indicating whether the two sentences have the same meaning or share the same intention. All features of the two datasets are summarized in Table 1. For each dataset, the accuracy (ACC) and F1 score are used as the evaluation metrics.  Hyper-parameters The number of graph updating steps/layers L is 2 on both datasets. The dimension of node representation is 128. The dropout rate for all hidden layers is 0.2. The number of matching perspectives P is 20. Each model is trained by RMSProp with an initial learning rate of 0.0001 and a batch size of 32. We use the vocabulary provided by Song et al. (2018) to build the lattice.

Main Results
We compare our models with two types of baselines: basic neural models without pre-training and BERT-based models pre-trained on largescale corpora. The basic neural approaches also can be divided into two groups: representationbased models and interaction-based models. The representation-based models calculate the sentence representations independently and use the distance as the similarity score. Such models include Text-CNN (Kim, 2014), BiLSTM (Graves and Schmidhuber, 2005) and Lattice-CNN (Lai et al., 2019). Note that Lattice-CNN also takes word lattices as input. The interaction-based models consider the interaction between two sentences when calculating sentence representations, which include BiMPM (Wang et al., 2017) and ESIM (Chen et al., 2017). ESIM has achieved state-of-the-art results on various matching tasks (Bowman et al., 2015;Williams et al., 2018). For pre-trained models, we consider BERT and its several variants such as BERT-wmm (Cui et al., 2019), BERT-wmm-ext (Cui et al., 2019) and ERNIE Cui et al., 2019). One common feature of these variants of BERT is that they all use word information during the pre-trained phase. We use GMN-BERT to denote our proposed model. We also employ a character-level transformer encoder instead of BERT as the contextual node embedding module described in Section 3.1, which is denoted as GMN. The comparison results are reported in Table 2.
From the first part of the results, we can find that our GMN performs better than five baselines on both datasets. Also, the interaction-based models in general outperform the representation based models. Although Lattice-CNN 2 also utilizes word lattices, it has no node-level comparison due to the limits of its structure, which causes significant performance degradation. As for interactionbased models, although they both use the multiperspective matching mechanism, GMN outperforms BiMPM and ESIM (char and word) 3 , which indicates that the utilization of word lattice with our neural graph matching networks is powerful.
From the second part of Table 2, we can find that the three variants of BERT (BERT-wwm, BERTwwn-ext, ERNIE) 4 all outperform the original BERT, which indicates using word-level information during pre-training is important for Chinese matching tasks. Our model GMN-BERT performs better than all these BERT-based models. It shows that utilizing word information during the finetuning phase with GMN is an effective way to boost the performance of BERT for Chinese semantic matching.

Analysis
In this section, we investigate the effect of word segmentation on our model GMN. A word sequence can be regarded as a thin graph. Therefore, it can be used to replace the word lattice as the input of GMN. As shown in Figure 3, we compare four models: Lattice is our GMN with word lattice as the input. PKU and JIEBA are similar to Lattice except that their input is word sequence produced by two word segmentation tools: Jieba 5 and pkuseg , while the input of JIEBA+PKU is a small lattice graph generated by merging two word segmentation results. We can find that lattice-based models (Lattice and JIEBA+PKU) performs much better then wordbased models (PKU and JIEBA). We can also find that the performance of PKU+JIEBA is very close to the performance of Lattice. The union of different word segmentation results can be regarded as a tiny lattice, which is usually the sub-graph of the overall lattice. Compared with the tiny graph, the overall lattice has more noisy nodes (i.e. invalid words in the corresponding sentence). Therefore We think it is reasonable that the performance of tiny lattice (PKU+JIEBA) is comparable to the performance of the overall lattice (Lattice). On

Conclusion
In this paper, we propose a neural graph matching model for Chinese short text matching. It takes a pair of word lattices as input instead of word or character sequences. The utilization of word lattice can provide more multi-granularity information and avoid the error propagation issue of word segmentation. Additionally, our model and the pre-training model are complementary. It can be regarded as a flexible method to introduce word information into BERT during the fine-tuning phase. The experimental results show that our model outperforms the state-of-the-art text matching models as well as some BERT-based models.