SynET: Synonym Expansion using Transitivity

In this paper, we study a new task of synonym expansion using transitivity, and propose a novel approach named SynET, which considers both the contexts of two given synonym pairs. It introduces an auxiliary task to reduce the impact of noisy sentences, and proposes a Multi-Perspective Entity Matching Network to match entities from multiple perspectives. Extensive experiments on a real-world dataset show the effectiveness of our approach.


Introduction
Synonym discovery has become an important task, which can benefit many downstream applications, such as web search , question answering (Zhou et al., 2013), knowledge graph construction (Boteanu et al., 2019), clinical text analysis (Wang et al., 2019b), and etc.
One straightforward approach to obtain synonyms is from public knowledge bases, such as WordNet (Fellbaum, 2000) and DBpedia . For example, WordNet groups terms into synsets, and DBpedia uses Redirects to URIs to indicate synonyms. However, these synonyms are constructed manually, which makes the coverage rather limited.
Two types of approaches are widely exploited to discover synonyms automatically from text corpora, including the distributional based approaches (Wang et al., 2019a,b;Fei et al., 2019) and the pattern based approaches (Nguyen et al., 2017). The distributional based approaches assume that if two terms appear in similar contexts, they are likely to be synonyms. For example, "USA" and "the United States" are often mentioned in similar contexts, so they both refer to the same country. The pattern based approaches lay emphasis on the local * Corresponding author contexts, such as "commonly known as". However, they both have some limitations. The distributional based approaches suffer from low precision, while the pattern based approaches suffer from low recall. In order to address these limitations, DPE (Qu et al., 2017) integrated these two approaches for synonym discovery.
Intuitively, people believe that synonyms possess transitivity, that is (m i , synonym, m b ) ∧ (m b , synonym, m j )→(m i , synonym, m j ), where m i , m b and m j are three different mentions, and m b is the bridge mention of two synonym pairs (m i , m b ) and (m b , m j ). This transitivity can be used for synonym discovery directly from existing synonyms.  Figure 1(c). This is because 荷花 芙蓉 金丝雀 is polysemous, which has two meanings: 荷花 芙蓉 金丝雀 (canary) and 木莲 (hibiscus). Therefore, using transitivity between synonym pairs directly would make wrong synonym pairs. Therefore, it is hazardous to infer (m i , synonym, m j ) directly, when (m i , synonym, m b ) and (m b , synonym, m j ) are given. There are several challenges to address this problem. Firstly, if we directly use distributional approaches to predict whether Also known as (a) Synonyms mentioned in infobox in Baidu Baike.
Figure 1: Motivation of our task: synonyms are mentions by Also known as in Baidu Baike (a) and Wikidata (b). However, synonym transitivity is not always hold as shown in (c), which is the transitivity graph from (a). In Figure (c), the edges with red cross indicate that the corresponding two mentions are not synonymous.
two mentions m i and m j are synonymous without using the information of (m i , synonym, m b ) and (m b , synonym, m j ), the precision would be low, since the global context of m i (m j ) is various. Secondly, pattern based approaches can not be applied effectively, since the sentences mentioning both m i and m j may be fewer than the sentences mentioning both m i and m b (or m b and m j ). In our paper, these sentences mentioning both two mentions are called support sentences. We analyze the distribution of the support sentences in our dataset, which will be elaborated in Section 4.1, and the results are shown in Figure 2. From the figure, we find that about 60% pairs of (m i , m b ) or (m b , m j ) have more than 5 support sentences, but only less than 30% pairs of (m i , m j ) have more than 5 support sentences, and even 43% pairs of (m i , m j ) have no support sentences. Thirdly, the support sentences are obtained in a distant-supervised way, which may bring in lots of noises. Although the sentences mentioning two mentions in a synonym pair may express the same meaning, which can partly reduce the noise, we still have to reduce the impact of the noisy sentences further.  In order to address these challenges, we propose a new synonym discovery task Synonym Expansion using Transitivity ( Figure 3): Given two sets of synonym pairs (m i , m b ) and (m b , m j ) with a bridge mention m b and their corresponding support sentences, which are obtained from text corpus through distant supervision, we aim to predict whether m i and m j are synonyms or not.
For the task, we propose a novel framework, named SYNET, which leverages both the contexts of two given synonym pairs. First, it introduces an auxiliary task to reduce the impact of noisy sentences further, and then proposes a Multi-Perspective Mention Matching Network (MPM-M) to match mentions from multiple perspectives, including M2M (Mention-to-Mention), M2B (Mention-to-mention Bag) and B2B (mention Bagto-mention Bag) matches.
Our contributions in this paper are as follows: • We study a new task of synonym expansion using transitivity, and propose a novel approach named SYNET for this task. To the best of our knowledge, it is the first to study the problem of synonym expansion using transitivity.
• We construct a dataset from encyclopedias through distant supervision, and the experiments on it show the effectiveness of our approach.

Task Definition
We first introduce basic concepts and their notations, and then present the task definition. Synonym Pair. A synonym pair is a pair of strings (i.e. word or phrases) that refer to the same entity in the world. For example, ("United States", "USA") is a synonym pair, since "United States" and "USA" represent the same country. We can extract synonym pairs directly from the infobox of Baidu Baike or Wikidata as shown in Figure 1.
Synonym Pair Candidate. A synonym pair candidate can be obtained from existing synonym pairs according to the synonym transitivity properties. For example, ("the United States of America", "USA") and ("USA", "America") are two synonym pairs, so ("the United States of America" and "America") can be considered as a synonym pair candidate. Formally, if (m i , m b ) and (m b , m j ) are two synonym pairs, (m i , m j ) can be considered as a synonym pair candidate. Since synonym transitivity is not always hold, (m i , m j ) can not be treated as a synonym pair directly, as mentioned in Figure 1(c). Support Sentence. In order to predict whether two mentions m i and m j in a synonym pair candidate are synonymous, we should collect some sup-port sentences. Since the sentences contain both m i and m j are sparse or even nonexistent, we turn to collect sentences which contain mentions in synonym pairs (m i , m b ) and (m b , m j ). We denote S i is a bag of support sentences for (m i , m b ), and each sentence in S i contains the two mentions m i and m b . Taking the synonym pair ("the United States of America", "USA") as an example, the sentence "The United States of America, commonly known as the United States, America or USA." is one of its support sentences.
Task Definition. We formally define our task of synonym discovery using transitivity as: Given two synonym pairs (m i , m b ) and (m b , m j ), where m b is the bridge mention, and two sets of corresponding support sentences S i and S j , s ∈ S i (S j ) mentions both m i and m b (m j and m b ), the task is to predict whether the two mentions m i and m j in a synonym pair candidate are synonymous or not. Figure 3 illustrates the task with an example.

The SYNET Approach
In this section, we introduce our proposed approach SYNET for synonym discovery using transitivity. As shown in Figure 4, our proposed SYNET approach mainly consists of three components, including Sentence Encoder (Section 3.1), Mention Encoder (Section 3.3) and Multi-Perspective Mention Matching Network (MPMM) (Section 3.4). In the following sections, we will elaborate each component in detail.

Sentence Encoder
We can employ a BiLSTM (Hochreiter and Schmidhuber, 1997) or BERT (Devlin et al., 2019) to encode each support sentence s in S i , where s is a sequence of words w 1 , w 2 , ..., w n , and two mentions m i and m b of a synonym pair in the sentence is a subsequence of words w is , ..., w ie and w bs , ..., w be respectively. Each word w i is mapped to a pretrained d w -dimensional vector w i .

BiLSTM based Sentence Encoder
BiLSTM based sentence encoder ( Figure 5) firstly encodes sentence s into hidden states (h 1 , h 2 , ..., h n ): where LST M f w and LST M bw are the forward and backward LSTMs respectively, and v t = [ w t ⊕ p 1 t ⊕ p 2 t ], p 1 t , p 2 t ∈ R dp are two position embeddings (Zeng et al., 2015).
Then, the sentence embedding is calculated by v s = tanh(W s h s + b s ). In addition, the embedding of mention m i can also be calculated by

BERT Based Sentence Encoder
The BERT based sentence encoder is shown in Figure 6. The input s is firstly organized as where T i is the concatenation of the word embedding, segmentation embedding and position embedding. The mention m i is enclosed by a mark token [E i ], which is trained using reserved tokens [unused] in BERT. Then, BERT encodes Similar to BiLSTM based sentence encoder, the final sentence embedding for s is During training, we start from a pre-trained BERT model 3 , and then fine-tune it using our training data.

Auxiliary Task for Noise Reduction
In order to reduce the impact of noise in where l i is the number of support sentences in S i , we introduce an auxiliary task, which takes S i as the input, and predicts the importance of each sentence with the attention mechanism through synonym relation classification.
Formally, a set of sentence embeddings is obtained by the sentence encoder. Then we randomly initialize a relation vector v r ∈ R dc to calculate the attention weight for .
Finally, S i can be represented by Therefore, the probability of synonym prediction is p(m where ∼ denotes two mentions are synonymous. The loss for the synonym triple (m i , m b , m j ) in this auxiliary task is:

Mention Encoder
During the sentence encoding for each sentence s in S i , we can also obtain the mention embeddings v m i and v m b for m i and m b as in Section 3.1. Thus, two bags of mention embeddings can be obtained from S i : Since sentences in the bag have some noise, we have calculated the attention weight α j i for each sentence s j i ∈ S i in Section 3.2. Therefore, the aggregated embeddings for mention m i and m b in S i can also be calculated as:

Multi-Perspective Mention Matching Network
In order to predict whether m i ∈ S i and m j ∈ S j are synonyms, an intuitional and direct idea is to measure the semantic similarity between m i and m j . We can fuse V m i and V m b to represent the semantic of the mention m i with a gating mechanism: whereg ∈ R dc is a learnable parameter, σ is a Sigmoid function, and is an element-wise multiplication. Thus, we can use sof tmax(W (V i m V j m )+b) to predict the synonymity between m i and m j , where V i(j) m is the mention representation of S i(j) , W and b are two learnable parameters.
Besides V m i , V m b and V m j , B i , B b and B j are also used to represent the mentions of m i , m b and m j . Thus, we propose a multi-perspective mention matching network (MPMM) to match mentions from multiple perspectives, including M2M (Mention-to-Mention), M2B (Mention-tomention Bag) and B2B (mention Bag-to-mention Bag) matches. In order to differentiate m b in S i and S j , we use is the bridge mention embedding of s k ∈ S i(j) . Figure 7 illustrates the MPMM in detail. In our experiment, we find that the semantic consistency of m b between S i and S j is more effective to predict the synonymity between m i and m j . Thus, we use B : LSTM has achieved some success in aggregating an unordered set, such as in (Hamilton et al., 2017;Zhang et al., 2020). Here, given V i m and B j b , we also use LSTM to aggregate them as follows. [h, c]) is a LSTM cell. The final output of the LSTM h l j is denoted as V i M 2B . Similarly, we can also obtain V j M 2B = h l i when putting V j m and B i b into the LSTM. Finally, the probability of m i and m j being synonymous can be calculated by The following loss is used for the synonym triple (m i , m b , m j ) with corresponding support sentences S i and S j :

Model Optimization and Inference
To train the SYNET, we minimize the overall objective: L = T t=1 (L it,jt aux + L it,jt mm ), where T is the number of synonym triples {(m it , m bt , m jt )} T t=1 . During the inference step, we use p(m i ∼ m j |S i , S j ) to predict whether m i and m j are synonyms or not.

Dataset Construction
We build a dataset SYNETDATA from Baidu Baike, which is the largest Chinese encyclopedia in China, in a distant supervision way.
The instance of the dataset is a six-tuple  Figure 1(a). Then, we randomly selected 3 mentions from a group, which can be considered as a positive instance (m i , m b , m j ) with label l = 1.
For negative instances, we first crawl disambiguation pages from Baidu Baike and extract all senses for each mention. This mention can be considered as a bridge mention. For example, Figure 8 shows several senses for 荷花 芙蓉 金丝雀 . Then, we randomly select two senses, such as a plant and a bird, and then extract synonyms for each sense from infoboxs of the articles. For plant, we can extract 木莲 , while for bird, we can extract in articles of Baidu Baike, which are indexed with Lucene 4 . Since the sentence with a longer distance between two mentions would be noisier, we sort the sentences according to the distance between two mentions, and select the top 16 sentences as S i (S j ) in order to fit in the BERT model. All sentences in S i and S j are segmented by HanLP 5 . The statistics of the dataset are presented in Table 1, and the number of support sentences in each bag is from 2 to 16.

Experimental Settings
We compare SYNET with the following baselines.
• Word2vec. We concatenate the word embeddings of m i and m j , which are pre-trained using word2vec 6 with all articles in Baidu Baike, and then input it to a multi-layer perceptron for synonym prediction.
• BiLSTM. We employ a BiLSTM to encode each support sentence s and calculate the embedding of the mention m i v m i as in Section 3.1.1. Then, we average the embeddings of the mention m i over all support sentences to obtain the final representation of m i : Finally, we concatenate the embeddings of two mentions V i and V j , and input it to a multi-layer perception for synonym prediction.
• BERT. We concatenate the embeddings of two mentions V m i and V m j , which are obtained from the BERT based sentence encoder as in Section 3.1.2, and then input it to a multi-layer perception for synonym prediction.
• SynonymNet (Zhang et al., 2019). Syn-onymNet also use BiLSTM to encode the contexts of each mention, and then use a bilateral matching schema to determine synonymity. In our experiment, we use S i and S j as the contexts of m i and m j . In addition, two architectures for training the SynonymNet are also implemented, including a siamese architecture and a triplet architecture.
• SynSetMine (Shen et al., 2019). SynSetMine learns a set-instance classifier to determine whether a synonym set S should include an instance t. In our experiment, we use SynSet-Mine to determine whether m i can be added to the set (m j , m b ) or m j can be added to the set (m i , m b ). We also implement its variants using different word embeddings, including word2vec, BERT and BiLSTM, and different aggregation methods, including mean pooling and sum pooling.
The accuracy, precision, recall and F1 are used to evaluate the approaches.
In our implements, we set the dimension of word embeddings with d w = 100, and set d c = 128, d p = 5 and d h = 768 for hidden states in the sentence encoder and mention encoder. We optimize our model using Adam (Kingma and Ba, 2015) and apply dropout technique with rate 0.1.

Main Results
We present our main results in Table 2. From the table, we can see that our approach outperforms all other approaches and their variants. SynonymNet and SynSetMine perform better than Word2vec and BiLSTM. For SynonymNet, the Siamese architecture works better on our dataset compared against the triplet architecture. While for SynSetMine, sum pooling can achieve a better performance than mean pooling.

Ablation Studies
We conduct an ablation study to evaluate the contribution of each model component, and show the results in Table 3.
From the table, we can see that (1) The auxiliary task can boost the performance both for SYNET(BERT) and SYNET(BiLSTM) by putting different weights on sentences, which can reduce the impact of noisy sentences. The benefit of the auxiliary task is statistically significant with p < 0.05 under t-test. (2) All perspectives of mention matching in MPMM are useful, and using only one perspective would reduce the performance greatly. The effectiveness of each perspective is M2B > B2B > M2M. The reason may be that L-STM can capture "deep" feature interactions and accumulate expression capability of mention embeddings.
(3) When only using M2M in MPMM, our approach will degrade to a synonym prediction model using BiLSTM with attention, where BiL-STM is used to encode mention m i and m j , while the auxiliary task calculates the attention weights of support sentences in S i and S j . Our approach performs better than the baseline BiLSTM in Table 2, which also verifies the effectiveness of the auxiliary task.
Besides, we also compare two strategies, using B b or B i(j) , in MPMM, and the results are shown in Table 4. From the table, we can see that the semantic consistency of m b between S i and S j is more effective than directly using B i(j) in MPMM both in SYNET(BiLSTM) and SYNET(BERT).

Related Work
Synonym discovery is a crucial task in NLP, and many efforts have been invested. One straightforward approach to obtain synonyms is from pub-    Recently, researchers focus on mining synonyms from a raw text corpus, which is more challenging. Two types of approaches are widely exploited, including the pattern based approaches (Nguyen et al., 2017) and the distributional based approaches (Wang et al., 2019a,b;Fei et al., 2019;Zhang et al., 2019). The pattern based approaches lay emphasis on the local contexts, such as "commonly known as". While the distributional based approaches assume that if two terms appear in similar contexts, they are likely to be synonyms. For example, SynonymNet (Zhang et al., 2019) proposed a multi-context bilateral matching framework for synonym discovery from free-text corpus. Surf-Con (Wang et al., 2019b) discovered synonyms on privacy-aware clinical data by utilizing the surface form information and the global context information. However, they suffer from either low precision or low recall. Thus, DPE (Qu et al., 2017) and SynMine  integrated these two approaches for synonym discovery. Moreover, SynSetMine (Shen et al., 2019) learned a set-instance classifier to generate entity synonym sets from a given vocabulary using example sets from external knowledge bases as distant supervision.
Our approach focuses on mining synonyms using transitivity which is not the focus of the previous works. Although  also utilized transitivity, they assumed that transitivity does hold in almost all cases for attribute synonyms, so they used this transitivity property to discover clusterbased synonyms by a linear programming-based algorithm. While our approach called this property into question, and only used it to generate synonym candidates.

Conclusion
In this paper, we study a new task of synonym expansion using transitivity, and propose a novel approach named SYNET. To the best of our knowledge, it is the first time to study this problem. The SYNET considers both the contexts of two given synonym pairs. It introduce an auxiliary task to reduce the impact of noisy sentences, and proposes a Multi-Perspective Entity Matching Network to match entities from multiple perspectives. Extensive experiments on a real-world dataset show the effectiveness of our approach.