End to End Chinese Lexical Fusion Recognition with Sememe Knowledge

In this paper, we present Chinese lexical fusion recognition, a new task which could be regarded as one kind of coreference recognition. First, we introduce the task in detail, showing the relationship with coreference recognition and differences from the existing tasks. Second, we propose an end-to-end model for the task, handling mentions as well as coreference relationship jointly. The model exploits the state-of-the-art contextualized BERT representations as an encoder, and is further enhanced with the sememe knowledge from HowNet by graph attention networks. We manually annotate a benchmark dataset for the task and then conduct experiments on it. Results demonstrate that our final model is effective and competitive for the task. Detailed analysis is offered for comprehensively understanding the new task and our proposed model.


Introduction
Coreference is one important topic in linguistics (Gordon and Hendrick, 1998;Pinillos, 2011), and coreference recognition has been extensively researched in the natural language processing (NLP) community (Ng and Cardie, 2002;Lee et al., 2017;Qi et al., 2012;Fei et al., 2019). There are different kinds of coreferences, such as pronoun anaphora and abbreviation (Mitkov, 1998;Mitkov, 1999;Muñoz and Montoyo, 2002;Choubey and Huang, 2018). Here we examine the phenomenon of Chinese lexical fusion, also called separable coreference (Chen and Ren, 2020), where two closely related words in a paragraph are united by a fusion form of the same meaning, and the fusion word can be seen as the coreference of the two separation words. Since the fusion words are always out-of-vocabulary (OOV) words in downstream paragraph-level tasks such as reading comprehension, summarization, machine translation and etc., which can hinder the overall understanding and lead to degraded performance, the new task can offer informative knowledge to these tasks.

卡纳瓦罗打破受访惯例，公开训练未接受 采访。
Cannavaro broke the convention of accepting interviews, did not accept an interview for public training.
首批医务人员昨日返杭，其余人员预计两周内回到 杭州。 The first medical personnel returned to Hangzhou yesterday, the others are expected to return to Hangzhou within two weeks.
Lexical fusion is used frequently in the Chinese language (Chen and Ren, 2020). Moreover, the fusion words are usually rarer words compared with their separation words coreferred, which are more difficult to be handled by NLP models (Zhang and Yang, 2018;Gui et al., 2019). Luckily, the meaning of the fusion words can be derived from that of the separation words. Thus recognizing the lexical fusion would be beneficial for downstream paragraph (or document)-level NLP applications such as machine translation, information extraction, summarization, etc. (Li and Yarowsky, 2008;Ferreira et al., 2013;Kundu et al., 2018). For example, for deep semantic parsing or translation, the fusion words "受访" (UNK) can be substituted directly by the separation words "接受" (accept) and "访问" (interview), as the same fusion words are rarely occurred in other paragraphs.
The recognition of lexical fusion can be accomplished by two subtasks. Given one paragraph, the fusion words, as well as the separation words should be detected as the first step, which is referred to as mention detection. Second, coreference recognition is performed over the detected mentions, linking each character in the fusion words to their coreferences, respectively. By the second step, full lexical fusion coreferences are also recognized concurrently. The two steps can be conducted jointly in a single end-to-end model (Lee et al., 2017), which helps avoid the error propagation problem, and meanwhile, enable the two subtasks with full interaction.
In this paper, we present a competitive end-to-end model for lexical fusion recognition. Contextual BERT representations (Devlin et al., 2019) are adopted as encoder inputs as they have achieved great success in a number of NLP tasks (Tian et al., 2019;Zhou et al., 2019;. For mention detection, a CRF decoder (Huang et al., 2015) is exploited to detect all mention words, including both the fusion and the separation words. Further, we use a BiAffine decoder for coreference recognition (Zhang et al., 2017;Bekoulis et al., 2018), determining a given mention pair either to be a coreference or not.
Since our task is semantic oriented, we use the sememe knowledge provided in HowNet (Dong and Dong, 2003) to help capturing the semantic similarity between the characters and the separation words. HowNet has achieved success in many Chinese NLP tasks in recent years (Duan et al., 2007;Gu et al., 2018;. Both Chinese characters and words are defined by senses of sememe graphs in it, and we exploit graph attention networks (GAT) (Velickovic et al., 2018) to model the sememe graphs to enhance our encoder.
Finally, we manually construct a high-quality dataset to evaluate our models. The dataset consists of 7,271 cases of the lexical fusion, which are all exploited as the test instances. To train our proposed models, we construct a pseudo dataset automatically from the web resource. Experimental results show that the auto-constructed training dataset is highly effective for our task. The end-to-end models achieved better performance than the pipeline models, and meanwhile the sememe knowledge can also bring significant improvements for both the pipeline and end-to-end models. Our final model can obtain an F-measure of 79.64% for lexical fusion recognition. Further, we conduct in-depth analysis work to understand the proposed task and our proposed models.
In summary, we make the following contributions in this paper: (1) We introduce a new task of lexical fusion recognition, providing a gold-standard test dataset and an auto-constructed pseudo training dataset for the task, which can be used as a benchmark for future research.
(2) We propose a competitive end-to-end model for lexical fusion recognition task, which helps to integrate the mention recognition with coreference identification based on BERT representations.
(3) We make use of the sememe knowledge from HowNet to help capturing the semantic relevance between the characters in the fusion form and the separation words.
All the codes and datasets will be open-sourced at https://github.com/liuyijiang1994/chinese lexical fusion under Apache License 2.0.

Chinese Lexical Fusion
Lexical fusion refers to the phenomenon that a composition word semantically corresponds with related words nearby in the contexts. The composition word can be seen as a compact form of the separation words in the paragraph, where the composition word is referred to as the fusion form of the separation words. The fusion form is often an OOV word, which brings difficulties in understanding of a given paragraph. We can use a tuple to define the phenomenon in a paragraph: w F , w A1 , w A2 , · · · , w An (n ≥ 2), where w F denotes the fusion word, and w A1 , w A2 , · · · , w An are the separation words. We regard the w F as a coreference of w A1 , w A2 , · · · , w An . More detailedly, the number n equals to the length of word w F , and the ith character of word w F corresponds to the separation word w Ai as one fine-grained coreference. The task is highly semantic-oriented, as all the fine-grained coreferences are semantically related. Notice that Chinese lexical fusion has the following attributes, making it different from others: (1) Each character of word w F must correspond to one and only one separation word. The one-one mappings (clusters) should be strictly abided by. For example, 二圣 (two saints), 李治 (Zhi Li), 武 则天 (Zetian Wu) 1 is not a lexical fusion phenomenon, because both the two characters "二" (two) and "圣" (saint) can not associate with any of the separations.
(2) For each fine-grained coreference, the ith character of word w F is mostly borrowed from its separated coreference w Ai directly, but it is not always true. A semantically-equaled character could be used as well, which greatly increases the difficulty of Chinese lexical fusion recognition. The given examples in Table 1 demonstrate the rule. As shown, the coreferences "返↔回到 (return)", "降↔下 调 (reduce)" and "息↔利率 (interest rate)" are all of these cases.
By the above characters, we can easily differentiate the Chinese lexical fusion with several other closelyrelated linguistic phenomena, as an example, the abbreviation, the illustrated negative examples in item (1) and (3) are both abbreviations. For simplicity, we limit our focus on n = 2 in this work, because this situation occupies over 99% of all the lexical fusion cases according to our preliminary statistics. Thus we can call tuples as triples all though this paper. and reformatting them to triples of w F , w A1 , w A2 . The end-to-end model alleviates the problem of error propagation in pipeline methods through a joint way. The model consists of an encoder, a CRF decoder for mention recognition and a biaffine decoder for coreference recognition. Further, we enhance the encoder with sememe knowledge from HowNet.

Basic Model
Encoder We exploit the BERT as the basic encoder because it has achieved state-of-the-art performances for a number of NLP tasks (Devlin et al., 2019). Given a paragraph c 1 · · · c n the output of BERT is a sequence of contextual representation at the character-level: where h 1 · · · h n is our desired representation.

Mention Detection
The CRF decoder is exploited to obtain the sequence labeling outputs. First, the encoder outputs h 1 · · · h n are transformed into tag scores at each position: where W ∈ R d h ×|tags| , |tags| is the number of tag classes. Then for each tag sequence y 1 . . . y n , its probability can be computed by: where T is a model parameter to indicate output tag transition score and Z is a normalization score. We can use standard Viterbi to obtain the highest-probability tagging sequence.
Coreference Recognition We take the input character representation h 1 · · · h n obtained by the encoder as well as the output tag sequence y 1 . . . y n of mention detection as inputs. For the output tag sequence, we exploit a simple embedding matrix E to convert tags into vectors e 1 · · · e n . Then we concatenate the two kinds of representations, getting the encoder output z 1 · · · z n , where z i = h i ⊕ e i . We treat coreference recognition as a binary classification problem. Given the encoder output z 1 · · · z n and two positions i and j, we judge whether the relation between the two characters c i and c j is a coreference or not. For one fine-grained coreference c f and w s , we regard all characters in w s be connected to c f . The score of the binary classification is computed by a simple BiAffine operation, where s i,j is one two-dim vector. We can refer to (Dozat and Manning, 2017) for the details of the BiAffine operation, which has shown strong capabilities in similar tasks .
Training For mention detection, the supervised training objective is to minimize the cross-entropy of the gold-standard tagging sequence: where y g 1 . . . y g n is the supervised answer. For coreference recognition, we adopt averaged cross-entropy loss over all input pairs as the training objective: where r i,j denotes the ground-truth relation. The probability is computed by: where Z i,j is a normalization factor. We combine losses from the two subtasks together for joint training: where α is a hyper-parameter to balance the losses of the two subtasks.

Sememe-Enhanced Encoder
Sememes and HowNet Sememe is regarded as the minimum semantic unit for the Chinese language, which has been exploited for several semantic-oriented tasks, such as word sense disambiguation, event extraction and relation extraction (Gu et al., 2018;. Our task is also semantic oriented because the lexical fusion and coreference are both semantic related. Thus sememe should be one kind of useful information for our model. We follow the majority of the previous work, extracting sememes for characters from HowNet, a manually-constructed sememe knowledge base. HowNet defines over 118,000 words (including characters) using about 2,000 sememes. 2 Figure 2 illustrates the annotations in HowNet. We can see that each word is associated with several senses, and further, each sense is annotated with several sememes, where sememes are organized by graphs.
Sememe to Character Representation For each character c i , we make use of all possible HowNet words covering c i , as shown in Figure 3, and further extract all the included senses by these words. Each sense corresponds to one sememe graph, as shown in Figure 2. The sememe to character representation is obtained by two steps. First, we obtain the sense representation by its sememe graph and the position offset of its source word. Then, we aggregate all sense representations to reach a character-level representation, resulting in the sememe-enhanced encoder.
We use standard GAT to represent the sememe graph. Let sm 1 · · · sm M denote all the sememes belonging to a given sense sn, and their internal graph structure is denoted by G, after apply the GAT module, we can get: h sm 1 · · · h sm M = GAT(e sm 1 · · · e sm M , G), where e * indicates the sememe embeddings, which are randomly initialized. Further, we obtain the first representation part of sn by averaging over h sm 1 · · · h sm M . The second part is obtained straightforwardly by the embedding of the position offset of the sense's source word. The position offset is denoted by [s, e], where s and e indicate the relative position of the start and end characters of the source word to the current character, which has been illustrated in Figure 3. We use the position offset as a single unit for embedding. Following, we concatenate the two parts, resulting in the sense representation: where ⊕ denotes vector concatenation.
Finally, we aggregate all sense representations by global attention (Luong et al., 2015) with the guide of the BERT outputs to obtain character-level representations. Let {sn 1 , · · · , sn N } denote the set of sense representations for one character c i , the sememe-enhanced representation for character c i can be computed by: where v is a model parameter for attention calculation, and h sem 1 · · · h sem n are the desired outputs which is used instead of the BERT outputs h 1 · · · h n for decoder.

Dataset
Test Data We build a lexical fusion dataset manually in this work, where the raw corpus is collected from SogouCA , a news corpus of 18 channels, including domestic, international, sports, social, entertainment, etc. The BRAT Rapid Annotation Tool (Stenetorp et al., 2012) is used for annotation. We label the boundary of mentions in the paragraph, determine their categories (fusion or separation words), link characters in the fusion words to the referred separation words, and finally format the annotation results as triples.
We annotate 17,000 paragraphs by five annotators who major in Chinese linguistics. Every paragraph is annotated by two annotators, and paragraphs with divergent annotation will be voted by all other annotators. After removing the paragraphs without lexical fusion, 7,271 lexical fusion cases are obtained with 91% consensus.

Pseudo Training Dataset
We construct a pseudo lexical fusion corpus to train the models, by making use of an online lexicon where words are offered with explanatory notes. 3 For a two-character word, if there are other two words in the same paragraph contain these characters each, or if they appear in the dictionary definition of the characters, then we obtain one context-independent triple. Finally, we collect 1,608 well-formed triples of lexical fusions and treat them as seeds to construct pseudo training instances. Note that the fusion words of these triples are currently acceptable and widely used by users, such as "停 车 (parking vehicles)↔停放 (parking)/车辆 (vehicle)". Then, we search for the paragraphs containing all three words of a certain triple, regarding them as valid cases of Chinese lexical fusion (Mintz et al., 2009). Finally, we obtain 11,034 paragraphs, which are divided into training and development sections for model parameter learning and hyper-parameter tuning, respectively.

Evaluation
A triple is regarded as correct if all the three elements are correctly recognized and meanwhile on their exact positions, We calculate triple-level and fine-grained precision (P), recall (R) and F-measure (F) values, and adopt the triple-level values as the major metrics to evaluate the model performance. We also calculate mention-level P, R and F values to evaluate the performance of mention detection.

Settings
All the hyper-parameters are determined according to the performance on the development dataset. We exploit the pretrained basic BERT representations as inputs for encoder (Devlin et al., 2019). 4 The tag, sememe and position embedding sizes are set to 25, 200 and 12, respectively. The head number of GAT is 3. We exploit dropout with a ration of 0.2 on the encoder outputs to avoid overfitting (Srivastava et al., 2014), and optimize all model parameters (including the BERT part) by standard back-propagation using Adam (Kingma and Ba, 2015) with a learning rate 0.001.   Table 3 shows the final results on the test dataset. Aiming to investigate the influence of sememe structure in detail, we construct a pseudo graph structure G for comparison where all sememes inside a sense are mutually connected. In addition, we conduct comparisons by using only the sememes from characters (single-character words), in order to explore the effect of the word-level sememe information. Our final model, the end-to-end model with word-level sememe-enhanced encoder joint + GAT(word real), achieves competitive performance, the triple-level F-measure reaches 79.64, which is the best-performance model, significantly better than the basic model without using the sememe information. 5 In addition to the contents of Table 3, we also tried the pipeline approach as a contrast. The triple-level F-measure of pipeline models with or without word-level sememe graph are 78.71% and 74.74%, respectively.

Results
According to the results, our proposed sememe-enhanced encoder is effective, bringing significant improvements over the corresponding basic model. The improvements on the triple-level F-measure is 78.45 − 76.45 = 2.04. The real sememe graph also gives consistently better performance (i.e., an increase of 0.69 on average) than the pseudo graph. As shown, the word-level sememe structure obtains F-measure improvements of 79.64 − 78.49 = 1.15, indicating the usefulness of the word-level information.

Analysis
Influence of Lexical Fusion Types We investigate the model performance with respect to different lexical fusion types. We classify the fusion word types by IV/OOV according to the training corpus, and further differentiate a fine-grained coreference by whether the fusion character is borrowed from its separation word (denoted by A) or not (denoted by B). Table 4 shows the results. Our models perform better for the IV categories than the OOV, which confirms with our intuition. In addition, we divide the IV/OOV further into AA and AB categories. 6 We can find that AB is much more difficult, obtaining only 41.1 of the F1 score. Further, by examining the overall performance of fine-grained coreference of type A and B, we can see that Type B leads to the low performance mainly. The performance gap between the two types is close to 50 on the F1 score.  Effect of Sememe Information In order to understand the sememe-enhanced encoder in-depth, we examine the sense-level attention weights by an example. As shown in Figure 4, a fine-grained coreference "降↔下调 (reduce)" is used for illustration, where the three characters "下" (lower), "调" (adjust) and "降" (reduce) are studied. Each character includes a set of senses, which are listed by the squares. 7 We can see that senses with shared sememes can mutually enhance their respective attention weights, which is consistent with the goal of coreference recognition. It is difficult to establish such a connection without using sememes. Figure 5(a) shows the F-measure values according to the relative order of the mentions. Intuitively, the recognition of the forward references is more difficult than that of the backward references. The results confirm our intuition, where the F1 score of backward reference is 4.2 points higher on average. In addition, our final model can improve performance of both types significantly.

Impact of the Separation Word Distance
The distance between the two separation words should be an important factor in coreference recognition. Intuitively, as the distance increases, the difficulty should be also increased greatly. Here we conduct analysis to verify this intuition. Figure 5(b) shows the comparison results, which is consistent with our supposition. In addition, we can see that our final end-to-end model behaves much better, with relatively smaller decreases as the distance increases.

Related Work
Coreference recognition has been investigated extensively in NLP for decades (McCarthy and Lehnert, 1995;Cardie and Wagstaff, 1999;Ng and Cardie, 2002;Elango, 2005). Lexical fusion can be regarded as one kind of coreference, however, it has received little attention in the NLP community.
Our proposed models are inspired by the work of neural coreference resolution (Fernández-Gallego et al., 2016;Clark and Manning, 2016;Xu et al., 2017;Lee et al., 2017;. We adapt these models by considering task-specific features of Chinese lexical fusion, for example, enhancing the encoder with a GAT module for structural sememe information. Another most closely-related topic is abbreviation (Zhong, 1985). There have been several studies on abbreviation prediction, recovery and dictionary construction (Sun et al., 2008;Li and Yarowsky, 2008;Liu et al., 2009;Zhang and Sun, 2018). Lexical fusion is different from abbreviation in many points. For example, abbreviation always refers to one inseparable mention, which is not necessary for lexical fusion. Besides, lexical fusion should abide by certain word construction rules, while abbreviation is for free.
BERT and its variations have achieved the leading performances for GLUE benchmark datasets (Devlin et al., 2019;Liu et al., 2019a;Liu et al., 2019b). For the close tasks such as coreference resolution and relation extraction, BERT representations have also shown competitive results (Joshi et al., 2019;, which inspires our work by using it as basic inputs aiming for competitive performance. The sense and sememe information has been demonstrated effective for several semantic-oriented NLP tasks (Niu et al., 2017;Gu et al., 2018;Zeng et al., 2018;. HowNet offers a large knowledge base of sememe-based (Dong and Dong, 2003), which has been adopted for sememe extraction. We encode the sememes by the form of a graph naturally to enhance our task encoder.

Conclusion
In this work, we introduced the task of lexical fusion recognition in Chinese and then presented an end-to-end model for the new task. BERT representation was exploited as the basic input for the models, and the model is further enhanced with the sememe knowledge from HowNet by graph attention networks. We manually annotated a benchmark dataset for the task, which was used to evaluate the models. Experimental results on the annotated dataset indicate the competitive performance of our final model, and the effectiveness of the joint modeling with the sememe-enhanced encoder. The analysis is further offered to understand the new task and the proposed model in-depth.