Correcting the Misuse: A Method for the Chinese Idiom Cloze Test

The cloze test for Chinese idioms is a new challenge in machine reading comprehension: given a sentence with a blank, choosing a candidate Chinese idiom which matches the context. Chinese idiom is a type of Chinese idiomatic expression. The common misuse of Chinese idioms leads to error in corpus and causes error in the learned semantic representation of Chinese idioms. In this paper, we introduce the definition written by Chinese experts to correct the misuse. We propose a model for the Chinese idiom cloze test integrating various information effectively. We propose an attention mechanism called Attribute Attention to balance the weight of different attributes among different descriptions of the Chinese idiom. Besides the given candidates of every blank, we also try to choose the answer from all Chinese idioms that appear in the dataset as the extra loss due to the uniqueness and specificity of Chinese idioms. In experiments, our model outperforms the state-of-the-art model.


Introduction
The Chinese idiom comprehension requires the ability to understand Chinese idioms. Chinese idiom, which is called "成语" (chengyu) in Chinese, consists of four characters. Chinese idioms are mostly derived from stories in ancient literature from Chinese history, and often reflect the moral behind the stories. To measure the ability of understanding Chinese idioms, the Chinese idiom cloze test dataset was proposed (Zheng et al., 2019): given a sentence with a blank, an examinee is required to choose an idiom which best matches the context surrounding the blank. Table 1 shows an example of the Chinese idiom cloze test.
The misuse of Chinese idioms is prevalent among Chinese native speakers who did not receive a professional Chinese education. Due to the metaphorical meaning of Chinese idioms, even Chinese native speakers who do not major in Chinese would use a Chinese idiom with its literal meaning, which causes misuse. Table 2 shows some common misuses of Chinese idioms. The misuse meaning is often related to the literal meaning.
The misuse of Chinese idiom appears in various social media and text such as Weibo and Zhihu. The Chinese word embeddings and Chinese language models are pretrained on these corpora that contain the misuse of Chinese idioms and learn the incorrect meaning of Chinese idioms. For example, in Table 3, we use Google Translate to translate Chinese idioms finding that some results are incorrect, and the incorrect meanings happen to be the common misuses of these Chinese idioms. In this paper, we introduce the definition of Chinese idiom, which is written by the Chinese experts, to correct the misuse. The complete definition describes the accurate interpretation and usage of Chinese idioms. Besides, because the misuse often comes from the literal meaning of the Chinese idiom, we propose an attention mechanism called Attribute Attention that extracts the relationships between the character-level and word-level representations.
Moreover, using the definition to correct the misuse does not mean that the non-misuse part would be dropped. Take 七月流火 in Table 2 as an example. The common misuse of 七月流火 is not totally incorrect. 七月流火 referring to the weather is correct, but the weather turning hot is incorrect. Therefore, we propose Attribute Attention to make use of other representations of 七月流火 even if they contain incorrect information.
In addition, Chinese idioms are derived from stories in ancient literature and contain abundant information. Chinese idioms contain more information so they are more likely to be used in a more specific context than common words. For example, 美 means "beautiful", 轮 means "wheel", Sentence with a blank 他们希望能 再进一步 They hope that they can and achieve greater success.
A candidate idiom 百尺竿头 Literal translation: at the top of a hundred-foot pole.
Free translation: make still further progress.

比喻到了极高的境地，仍须继续努力，求更大的进步。
When one has achieved great success, one should continue to work hard to make greater progress.  and 奂 means "magnificent". The Chinese idiom 美轮美奂 means "a building is beautiful". 美轮 美奂 can be used only when describing a building, whereas those four characters are not related to building. When those four characters are combined, the meaning becomes narrow. It is more difficult to find two similar Chinese idioms than normal words. In this paper, besides choosing the answer from the given candidates, our model tries to choose the answer from the whole vocabulary of candidate Chinese idioms that appear in the dataset and calculate its loss as a part of the final loss. In this way, relationships between much more idioms can be captured every time. It costs very few extra computing resources but provides significant improvement.
In experiments, our model outperforms the stateof-the-art model. Our main contributions are summarized as follows: • We introduce the definition and propose Attribute Attention to balance the importance of different representations of the Chinese idiom.
• We add an extra loss obtained by choosing the answer from all Chinese idioms that appear in the dataset, which costs very few extra computing resources but provides significant improvement.

Related Work
The cloze test is a classic task of reading comprehension and many methods were proposed (Hermann et al., 2015;Chen et al., 2016;Wang et al., 2018;. The Chinese idiom cloze test is more challenging because Chinese idioms convey the metaphorical meaning and are misused sometimes. Most works related to idioms focused on English idioms identification (Gedigian et al., 2006;Katz and Giesbrecht, 2006;Fazly et al., 2009;Shutova et al., 2010;Salton et al., 2016;Do Dinh et al., 2018b;Flor and Beigman Klebanov, 2018;Do Dinh et al., 2018a;Liu and Hwa, 2018). Some works have tried to use definitions: Spasic et al. (2017) analyzed the sentiment of definitions; Fathima Shirin and Raseek (2018) used the similarity between different definitions. However, these methods introduced definitions but did not try to understand them. Liu et al. (2017) used CharLSTM to encode the meaning of idioms, which has a similar idea to . Only a few works have been done with Chinese idioms such as building Chinese emotion lexicons (Xu et al., 2010) and improving Chinese word segmentation (Chan and Chong, 2008;Sun and Xu, 2011;Wang and Xu, 2017). Chengyu Reader (CR)  is proposed for the Chinese idiom cloze test, which used the def-

Approach
Formally, the Chinese idiom cloze test requires the model to choose the correct answer from a number of the candidate idioms given a sentence with a blank. The sentence is defined as a sequence of characters with a blank, which is also called context in the following. The candidate Chinese idiom is defined as a sequence of four characters, which is called idiom in the following. The definition is defined as a sequence of characters interpreting the corresponding idiom. In this paper, the term "BERT" refers to the BERT-like models (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2019;, because any one of them and even the new BERT-like model in the future can be used in our model. Figure 1 is an overview of our model. The following sections will introduce every part of our model one by one.

Integrating Context and Definition
The definition is not the next sentence of the context. The context and definition do not belong to the same document. It is inappropriate to set the context as the first sentence and set the definition as the second sentence separated by [SEP] for BERT. In this section, as shown in Figure 2, we propose a way to integrate the context and definition with BERT, which lets the model "know" that the definition is mainly related to the idiom. We input the context, the candidate idiom, and definition together. For example, we input the context "他 们 希 望 能 再 进 一 步 (they hope they can and achieve greater success) ", the candidate idiom "百 尺 竿 头 (make still further progress) ", and the definition "比喻高的成就 (an outstanding achievement) " together as "他们希望 The Multi-Head Attention is applied to the context and definition in different ways. Formally, the Multi-Head Attention for the context is: where v (l) i denotes the i-th character of the context at the l-th layer, and m (l) denotes the [MASK] token at the l-th layer; |v| denotes the number of characters of the context. The context only can "see" itself and the [MASK].
The Multi-Head Attention for the definition is: i denotes the i-th character of the definition d at the l-th layer, and v (l−1) [SEP ] denotes the first [SEP] token at the l-th layer; |d| denotes the number of characters of the definition. The definition is inaccessible to the context, which avoids that the BERT regards the definition as the next sentence of the context.
The Multi-Head Attention for the [MASK] is:  Figure 1: Architecture of our model.

Attribute Attention
This section is about how to do Attribute Attention and the preparations. In the beginning, we extract the summaries of the context, idiom, and definition. Then we calculate the weight of Attribute Attention with h m from Section 3.1. After that, Attribute Attention will be done with these summaries and the weight.

Summarizing Context
Summarizing context is to predict what kind of idiom would be the correct answer for the blank based on the contextual information. For example, in Figure 3a, the sentence is "他们希望能 再进 一步 (they hope they can and achieve greater success) ". The input is "他们希望能[MASK]再 进一步". The output of [MASK] is defined as h c as shown in Figure 3a.

Summarizing Idiom
We use BERT to extract and summary characterlevel information of Chinese idiom. The output is defined as h o , as shown in Figure 3b.
The context and candidate idioms are from the same corpus and share a similar contextual representation. Besides, the [CLS] is not used when summarizing context. Therefore, we use one BERT to model both the context and idiom and use the [CLS] to summarize idioms. In the example of Figure 3b, the candidate idiom is "百 尺 竿 头 (achieve great achievement) ". The input is "[CLS]百尺竿头"

Summarizing Definition
Introducing the definitions can correct the misuse of idioms. We use [CLS] to summary definition. In the example of Figure 3c, the definition is "比 喻高的成就 (an outstanding achievement) ". The input is "[CLS]比喻高的成就". The output of [CLS] is defined as h d .

Word Embedding of Idiom
We use word embeddings to extract word-level information in this section. To utilize more information from various corpora, more than one word embedding can be introduced. Different attributes of different word embeddings will be assigned different weights in Attribute Attention. The word embeddings from different sources of one idiom are defined as {e i } |e| i=1 , where |e| is the number of word embeddings.

Weight Generation
As shown in Figure 1, this section is about generating the weight with h m and {e i } |e| i=1 . For the standard attention mechanism, the attention weight is a series of scalars, whereas the attention weight is a series of vectors in Attribute Attention.
where W <i> m ∈ R m×b is a learnable parameter; m denotes the hidden size of attention, and b denotes the hidden size of BERT such as 768 or 1024.
h m generates the weight based on the context, which is more accurate but also more likely to overfit. The weight {a <i> m } |e|+2 i=1 may "remember" every context-idiom pair in the training set. |e| is the number of word embeddings. In this case, we also introduce word embeddings here. The word embedding cannot provide context information but will have stronger generalization ability because it is hard to overfit the training set unless an idiom only where W e <i> j ∈ R m×d is a learnable parameter; d denotes the size of word embedding such as 300.
a <i> m ∈ R m gives more accurate weight but may overfit, whereas a <i> e ∈ R m is more generalized but lacks the context. We add them up to get the where a <i> ∈ R m . In this way, we can have accuracy and generalization from the two weights.

Attention Calculation
We define a <i> j as the j-th element of a <i> . In other words a <i> j is the j-th element of the i-th After that, before applying the attention: where W ao ∈ R m×b , W ad ∈ R m×b , and W ae i ∈ R m×d are learnable parameters; m denotes the hidden size of attention, b denotes the hidden size of BERT, d denotes the size of word embedding. As shown in Figure 1, the attention goes through as: where h j is the j-th element of the output which is defined as h ∈ R m ; h o j is the j-th element of h o , h d j is the j-th element of h d , and e i j is the j-th element of e i ; h contains an accurate and correct description of an idiom under a certain context by choosing information from the idiom, definition, and word embeddings. The correct and important part of every representation remains, and the incorrect and unimportant part is dropped.
The final output of Attribute Attention is: where W ua ∈ R m×b is a learnable parameter. u a ∈ R 1 is the score to describe whether a candidate idiom is the correct answer.

Classification
This section will introduce the classification part in Figure 1. One reason for Attribute Attention summarizing the context and definition is to make use of word embedding. Using h m for classification can provide more details about the relationship between characters of the context and characters of the definition. Formally, the classification for h m is: where W cm ∈ R 1×b and b cm ∈ R 1 are learnable parameters. u m ∈ R 1 is the score describing whether a candidate idiom is the correct answer. u a and u m denote the score of one candidate idiom. We further define the {u ai } n i=1 and {u mi } n i=1 as the scores of all candidate idioms, where n denotes the number of candidate idioms. Then we add them up: and pass u si through softmax function: p i is the possibility for the i-th candidate idiom to be the correct answer. This is the end of inferring but not training.

Extra Loss
Because Chinese idioms are used in more unique and specific context than common words, we choose the answer from all Chinese idioms that appear in the whole cloze test dataset as an extra loss for training. Formally, we use h c to predict the correct answer from the whole vocabulary of candidate Chinese idioms: where W cv ∈ R v×b and b v ∈ R v are learnable parameters; v denotes the number of all candidate Chinese idioms which is much larger than n.
q ∈ R v are possibilities for all candidate idioms being the correct answer. In this way, the model can learn relationships between much more idioms every time. Due to the uniqueness and specificity of Chinese idioms, this will not cause limited noises but improve the performance significantly. Without Extra Loss, relationships between only given candidate idioms are considered every time.
When inferring, the max possibility of {p i } n i=1 is the final result. For training, the cross entropy loss of {p i } n i=1 is defined as l p , and the cross entropy loss of q is defined as l q . The final loss is: where β is a hyper-parameter to determine the weight of the loss l q . Empirically, we suggest setting the value of β as 0.5. l is the final loss for training.

Training Details
In this section, we will introduce the details and hyper-parameters for training our model.
Dataset ChID dataset (Zheng et al., 2019) is used in experiments. Table 1 shows a simple example of the dataset. Given a sentence with a blank and several candidate Chinese idioms, an examinee is required to choose a Chinese idioms which best matches the context surrounding the blank. The corpus of ChID contain news, novels, and essays. News and novels are treated as in-domain data, which contains a training set, a development set Dev, and a test set Test. Essays are reserved for out-of-domain test Out, which can evaluate the generalization ability. In this way, the model is trained on news and novels but evaluated on essays.
Ran and Sim are two test sets which have the same sentences as Test. In Ran, candidate idioms are not similar to the golden answer. In Sim, candidate idioms are similar idioms to golden answer.
Hyper-parameters n is 7 because there are seven candidate idioms for every blank in ChID dataset (Zheng et al., 2019). v is 3848 because ChID dataset (Zheng et al., 2019) contains 3848 candidate idioms in total. The hidden size of attention m is 100. β as 0.5.
Optimizer The optimizer is Adam (Kingma and Ba, 2014) for BERT with linear schedule and a warm-up ratio of 0.05. The learning rate for RoBERTa is 2e-5, and for other parameters is 1e-3.

Parameters number
The number of parameters of our model for experiments is 322M. The learnable parameters are initialized by (He et al., 2015). Metrics The metric for evaluation is the accuracy, which is implemented by Scikit-learn (Pedregosa et al., 2011).

Comparison
The description of other models are as follows: AR Attentive Reader (AR) (Hermann et al., 2015). AR uses an attention mechanism to read the sentence.
SAR Stanford Attentive Reader (SAR) (Chen et al., 2016). SAR is a improvement based on AR.

Extra Loss Studies
This section explores how β influence the accuracy of our model on Test. Figure 4 shows the results. When β = 0, the Extra Loss is not used, which shows the performance of our model that does not use Extra Loss. The accuracy increase very quickly when β < 0.3. The accuracy reaches the highest point when β = 0.5. The accuracy start decreasing slowly when β > 1. A larger β makes the extra loss l q too important and overshadow the normal loss l p , which makes the model deviate from its purpose. Extra Loss gives a significant improvement and costs very few computing resources.

Conclusion
In this paper, we propose a model for the Chinese idiom cloze test. We introduce the definition and propose Attribute Attention to balance the importance of different representations of the Chinese idiom. We add Extra Loss calculated by choosing the answer from the whole vocabulary of Chinese idioms to improve the performance further, which costs very few computing resources. In experiments, our model outperforms state-of-the-art method.