Language Modeling with Sparse Product of Sememe Experts

Most language modeling methods rely on large-scale data to statistically learn the sequential patterns of words. In this paper, we argue that words are atomic language units but not necessarily atomic semantic units. Inspired by HowNet, we use sememes, the minimum semantic units in human languages, to represent the implicit semantics behind words for language modeling, named Sememe-Driven Language Model (SDLM). More specifically, to predict the next word, SDLM first estimates the sememe distribution given textual context. Afterwards, it regards each sememe as a distinct semantic expert, and these experts jointly identify the most probable senses and the corresponding word. In this way, SDLM enables language models to work beyond word-level manipulation to fine-grained sememe-level semantics, and offers us more powerful tools to fine-tune language models and improve the interpretability as well as the robustness of language models. Experiments on language modeling and the downstream application of headline generation demonstrate the significant effectiveness of SDLM.


Introduction
Language Modeling (LM) aims to measure the probability of a word sequence, reflecting its fluency and likelihood as a feasible sentence in a human language. Language Modeling is an essential component in a wide range of natural language processing (NLP) tasks, such as Machine Translation (Brown et al., 1990;Brants et al., 2007), Speech Recognition (Katz, 1987), Information Retrieval (Berger and Lafferty, 1999; Ponte ⇤ Equal contribution. † Correspondence author. and Croft, 1998;Miller et al., 1999;Hiemstra, 1998) and Document Summarization (Rush et al., 2015;Banko et al., 2000). A probabilistic language model calculates the conditional probability of the next word given their contextual words, which are typically learned from large-scale text corpora. Taking the simplest language model for example, N-Gram estimates the conditional probabilities according to maximum likelihood over text corpora (Jurafsky, 2000). Recent years have also witnessed the advances of Recurrent Neural Networks (RNNs) as the state-of-the-art approach for language modeling (Mikolov et al., 2010), in which the context is represented as a low-dimensional hidden state to predict the next word.
Those conventional language models including neural models typically assume words as atomic symbols and model sequential patterns at word level. However, this assumption does not necessarily hold to some extent. Let us consider the following example sentence for which people want to predict the next word in the blank, The U.S. trade deficit last year is initially estimated to be 40 billion . People may first realize a unit should be filled in, then realize it should be a currency unit. Based on the country this sentence is talking about, the U.S., one may confirm it should be an American cur-rency unit and predict the word dollars. Here, the unit, currency, and American can be regarded as basic semantic units of the word dollars. This process, however, has not been explicitly taken into consideration by conventional language models. That is, although in most cases words are atomic language units, words are not necessarily atomic semantic units for language modeling. We argue that explicitly modeling these atomic semantic units could improve both the performance and the interpretability of language models.
Linguists assume that there is a limited close set of atomic semantic units composing the semantic meanings of an open set of concepts (i.e. word senses). These atomic semantic units are named sememes (Dong and Dong, 2006). i Since sememes are naturally implicit in human languages, linguists have devoted much effort to explicitly annotate lexical sememes for words and build linguistic common-sense knowledge bases. HowNet (Dong and Dong, 2006) is one of the representative sememe knowledge bases, which annotates each Chinese word sense with its sememes. The philosophy of HowNet regards the parts and attributes of a concept can be well represented by sememes. HowNet has been widely utilized in many NLP tasks such as word similarity computation (Liu, 2002) and sentiment analysis (Fu et al., 2013). However, less effort has been devoted to exploring its effectiveness in language models, especially neural language models.
It is non-trivial for neural language models to incorporate discrete sememe knowledge, as it is not compatible with continuous representations in neural models. In this paper, we propose a Sememe-Driven Language Model (SDLM) to leverage lexical sememe knowledge. In order to predict the next word, we design a novel sememesense-word generation process: (1) We first estimate sememes' distribution according to the context.
(2) Regarding these sememes as experts, we propose a sparse product of experts method to select the most probable senses. (3) Finally, the distribution of words could be easily calculated by marginalizing out the distribution of senses.
We evaluate the performance of SDLM on the language modeling task using a Chinese newsi Note that although sememes are defined as the minimum semantic units, there still exist several sememes for capturing syntactic information. For example, the word å "with" corresponds to one specific sememe ü˝Õ "FunctWord". paper corpus People's Daily ii (Renmin Ribao), and also on the headline generation task using the Large Scale Chinese Short Text Summarization (LCSTS) dataset (Hu et al., 2015). Experimental results show that SDLM outperforms all those data-driven baseline models. We also conduct case studies to show that our model can effectively predict relevant sememes given context, which can improve the interpretability and robustness of language models.

Background
Language models target at learning the joint probability of a sequence of words P (w 1 , w 2 , · · · , w n ), which is usually factorized as P (w 1 , w 2 , · · · , w n ) = Q n t=1 P (w t |w <t ). Bengio et al. (2003) propose the first Neural Language Model as a feed-forward neural network. Mikolov et al. (2010) use RNN and a softmax layer to model the conditional probability. To be specific, it can be divided into two parts in series. First, a context vector g t is derived from a deep recurrent neural network. Then, the probability P (w t+1 |w t ) = P (w t+1 ; g t ) is derived from a linear layer followed by a softmax layer based on g t . Let RNN(·, ·; ✓ NN ) denote the deep recurrent neural network, where ✓ NN denotes the parameters. The first part can be formulated as Here we use subscripts to denote layers and superscripts to denote timesteps. Thus h t l represents the hidden state of the L-th layer at timestep t.
x w t 2 R H 0 is the input embedding of word w t where H 0 is the input embedding size. We also have g t 2 R H 1 , where H 1 is the dimension of the context vector.
Supposing that there are N words in the language we want to model, the second part can be written as where w w is the output embedding of word w and w 1 , w 2 , · · · w N 2 R H 2 . Here H 2 is the output embedding size. For a conventional neural language model, H 2 always equals to H 1 .

P(sense)
"fruit" "bring" "computer" "SpeBrand" "able" "PatternVal" Given the corpus {w t } n t=1 , the loss function is defined by the negative log-likelihood: where , ✓ NN } is the set of parameters that are needed to be trained.

Methodology
In this section, we present our SDLM which utilizes sememe information to predict the probability of the next word. SDLM is composed of three modules in series: Sememe Predictor, Sense Predictor and Word Predictor. The Sememe Predictor first takes the context vector as input and assigns a weight to each sememe. Then each sememe is regarded as an expert and makes predictions about the probability distribution over a set of senses in the Sense Predictor. Finally, the probability of each word is obtained in the Word Predictor.
Here we use an example shown in Figure 2 to illustrate our architecture. Given context ⌘(ú ÌX "In the orchard, I pick", the actual next word could be˘ú "apples". From the context, especially the word úÌ "orchard" and X "pick", we can infer that the next word probably represents a kind of fruit. So the Sememe Predictor assigns a higher weight to the sememe 4ú "fruit" (0.9) and lower weights to irrelevant sememes like 5 ⌘ "computer" (0.1). Therefore in the Sense Predictor, the sense˘ú (4ú) "apple (fruit)" is assigned a much higher probability than the senseú (5⌘) "apple (computer)". Finally, the probability of the word˘ú "apple" is calculated as the sum of the probabilities of its senses˘ú (4 ú) "apple(fruit)" and˘ú (5⌘) "apple (computer)".
In the following subsections, we first introduce the word-sense-sememe hierarchy in HowNet, and then give details about our SDLM.

Word-Sense-Sememe Hierarchy
We also use the example of "apple" to illustrate the word-sense-sememe hierarchy. As shown in Figure 3, the word˘ú "apple" has two senses, one is the Apple brand, the other is a kind of fruit. Each sense is annotated with several sememes organized in a hierarchical structure. More specifically, in HowNet, sememes "PatternVal", "bring", "SpeBrand", "computer" and "able" are annotated with the word "apple" and organized in a tree structure. In this paper, we ignore the structural relationship between sememes. For each word, we group all its sememes as an unordered set.
We present the notations that we use in the following subsections as follows. We define the overall sememe, sense, and word set as E, S and W. And we suppose the corpus contains K = |E| sememes, M = |S| senses and N = |W| words. For word w 2 W, we denote its corresponding sense set as S (w) . For sense s 2 S (w) , we denote its corresponding sememes as an unordered set

Sememe Predictor
The Sememe Predictor takes the context vector g 2 R H 1 as input and assigns a weight to each sememe. We assume that given the context w 1 , w 2 , · · · , w t 1 , the events that word w t contains sememe e k (k 2 {1, 2, · · · , K}) are independent, since the sememe is the minimum semantic unit and there is no semantic overlap between any two different sememes. For simplicity, we ignore the superscript t. We design the Sememe Predictor as a linear decoder with the sigmoid activation function. Therefore, q k , the probability that the next word contains sememe e k , is formulated as where v k 2 R H 1 , b k 2 R are trainable parameters, and (·) denotes the sigmoid activation function.

Sense Predictor and Word Predictor
The architecture of the Sense Predictor is motivated by Product of Experts (PoE) (Hinton, 1999). We regard each sememe as an expert that only makes predictions on the senses connected with it. Let D (e k ) denote the set of senses that contain sememe e k , the k-th expert. Different from conventional neural language models, which directly use the inner product of the context vector g 2 R H 1 and the output embedding w w 2 R H 2 for word w to generate the score for each word, we use (k) (g, w) to calculate the score given by expert e k . And we choose a bilinear function parameterized with a matrix U k 2 R H 1 ⇥H 2 as a straight implementation of (k) (·, ·): Let w s denote the output embedding of sense s. The score of sense s provided by sememe expert e k can be written as (k) (g, w s ). Therefore, P (e k ) (s|g), the probability of sense s given by expert e k , is formulated as where C k,s is a normalization constant because sense s is not connected to all experts (the connections are sparse with approximately N edges, < 5). Here we can choose either In the Sense Predictor, q k can be viewed as a gate which controls the magnitude of the term C k,s (k) (g, w ws ), thus control the flatness of the sense distribution provided by sememe expert e k . Consider the extreme case when q k ! 0, the prediction will converge to the discrete uniform distribution. Intuitively, it means that the sememe expert will refuse to provide any useful information when it is not likely to be related to the next word.
Finally, we summarize the predictions on sense s by taking the product of the probabilities given by relevant experts and then normalize the result; that is to say, P (s|g), the probability of sense s, satisfies Using Equation 5 and 6, we can formulate . (8) It should be emphasized that all the supervision information provided by HowNet is embodied in the connections between the sememe experts and the senses. If the model wants to assign a high probability to sense s, it must assign a high probability to some of its relevant sememes. If the model wants to assign a low probability to sense s, it can assign a low probability to its relevant sememes. Moreover, the prediction made by sememe expert e k has its own tendency because of its own (k) (·, ·). Besides, the sparsity of connections between experts and senses is also determined by HowNet itself. For our dataset, on average, a word is connected with 3.4 sememe experts and each sememe expert will make predictions about 22 senses.
As illustrated in Figure 2, in the Word Predictor, we get P (w|g), the probability of word w, by summing up probabilities of corresponding s given by the Sense Predictor, that is

Implementation Details
Basis Matrix Actually, HowNet contains K ⇡ 2000 sememes. In practice, we cannot directly introduce K ⇥ H 1 ⇥ H 2 parameters, which might be computationally infeasible and lead to overfitting. To address this problem, we apply a weightsharing trick called the basis matrix. We use R basis matrices and their weighted sum to estimate U k : where Q r 2 R H 1 ⇥H 2 , ↵ k,r > 0 are trainable parameters, and P R r=1 ↵ k,r = 1. Weight Tying To incorporate the weight tying strategy (Inan et al., 2017;Press and Wolf, 2017), we use the same output embedding for multiple senses of a word. To be specific, the sense output embedding w s for each s 2 S (w) is the same as the word input embedding x w .

Experiments
We evaluate our SDLM on a Chinese language modeling dataset, namely People's Daily based on perplexity. iii Furthermore, to show that our SDLM structure can be a generic Chinese word-level decoder for sequence-to-sequence learning, we conduct a Chinese headline generation experiment on the LCSTS dataset. Finally, we explore the interpretability of our model with cases, showing the effectiveness of utilizing sememe knowledge.

Language Modeling Dataset
We choose the People's Daily Corpus, which is widely used for Chinese NLP tasks, as the resource. It contains one month's news text from People's Daily (Renmin Ribao). Taking Penn Treebank (PTB) (Marcus et al., 1993) as a reference, we build a dataset for Chinese language modeling based on the People's Daily Corpus with 734k, 10k and 19k words in the training, validation and test set. After the preprocessing similar to (Mikolov et al., 2010) (see Appendix A), we get our dataset and the final vocabulary size is 13,476.

Baseline
As for baselines, we consider three kinds of neural language modeling architectures with LSTM cells: simple LSTM, Tied LSTM and AWD-LSTM. Zaremba et al. (2014) use the dropout strategy to prevent overfitting for neural language models and adopt it to two-layer LSTMs with different embedding and hidden size: 650 for medium LSTM, and 1500 for large LSTM. Employing the weight tying strategy, we get Tied LSTM with better performance. We set LSTM and Tied LSTM of medium and large size as our baseline models and use the code from PyTorch examples iv as their implementations.

LSTM and Tied LSTM
AWD-LSTM Based on several strategies for regularizing and optimizing LSTM-based language models, Merity et al. (2018) propose AWD-LSTM iii Although we only conduct experiments on Chinese corpora, we argue that this model has the potential to be applied to other languages in the light of works on construction sememe knowledge bases for other languages, such as (Qi et al., 2018). iv https://github.com/pytorch/examples/ tree/master/word_language_model as a three-layer neural network, which serves as a very strong baseline for word-level language modeling. We build it with the code released by the authors v .

Experimental Settings
We apply our SDLM and other variants of softmax structures to the architectures mentioned above: LSTM (medium / large), Tied LSTM (medium / large) and AWD-LSTM. MoS and SDLM are only applied on the models that incorporate weight tying, while tHSM is only applied on the models without weight tying, since it is not compatible with this strategy. For a fair comparison, we train these models with same experimental settings and conduct a hyper-parameter search for baselines as well as our models (the search setting and the optimal hyper-parameters can be found in Appendix C.1). We keep using these hyper-parameters in our SDLM for all architectures. It should be emphasized that we use the SGD optimizer for all architectures, and we decrease the learning rate by a factor of 2 if no improvement is observed on the validation set. We uniformly initialize the word embeddings, the class embeddings for cHSM and the non-leaf embeddings for tHSM in [ 0.1, 0.1]. In addition, we set R, the number of basis matrices, to 5 in Tied LSTM architecture and to 10 in AWD-LSTM architecture. We choose the left normalization strategy because it performs better. Table 1 shows the perplexity on the validation and test set of our models and the baseline models. From Table 1, 2, and 3, we can observe that: 1. Our models outperform the corresponding baseline models of all structures, which indicates the effectiveness of our SDLM. Moreover, our SDLM not only consistently outperforms state-of-the-art MoS model, but also offers much better interpretability (as described in Sect. 4.3), which makes it possible to interpret the prediction process of the language model. Note that under a fair comparison, we do not see MoS's improvement over AWD-LSTM while our SDLM outperforms it by 1.20 with respect to perplexity on the test set. 2. To further locate the performance improvement of our SDLM, we study the perplexity of the single-sense words and multi-sense words separately on Tied LSTM (medium) and Tied LSTM (medium) + SDLM. Improvements with respect to perplexity are presented in Table 2. The performance on both single-sense words and multi-sense words gets improved while multi-sense words benefit more from SDLM structure because they have richer sememe information. 3. In Table 3 we study the perplexity of words with different mean number of sememes. We can see that our model outperforms baselines in all cases and is expected to benefit more as the mean number of sememes increases.   We also test the robustness of our model by randomly removing 10% sememe-sense connections in HowNet. The test perplexity for Tied LSTM iv We find that multi-layer AWD-LSTM has problems converging when adopting cHSM, so we skip that result.

Experimental Results
[  (medium) + SDLM slightly goes up to 97.67, compared to 97.32 with a complete HowNet, which shows that our model is robust to tiny incompleteness of annotations. However, the performance of out model is still largely dependent upon the accuracy of sememe annotations. As HowNet is continuously updated, we expect our model to perform better with sememe knowledge of higher quality.

Headline Generation Dataset
We use the LCSTS dataset to evaluate our SDLM structure as the decoder of the sequence-tosequence model. As its author suggests, we divide the dataset into the training set, the validation set and the test set, whose sizes are 2.4M, 8.7k and 725 respectively. Details can be found in Appendix B.

Models
For this task, we consider two models for comparison.

RNN-context
As described in (Bahdanau et al., 2015), RNN-context is a basic sequence-tosequence model with a bi-LSTM encoder, an LSTM decoder and attention mechanism adopted. The context vector is concatenated with the word embedding at each timestep when decoding. It's widely used for sequence-to-sequence learning, so we set it as the baseline model.

RNN-context-SDLM
Based on RNN-context, we substitute the decoder with our proposed SDLM and name it RNN-context-SDLM.

Experimental Settings
We implement our models with PyTorch, on top of the OpenNMT libraries v . For both models, we set the word embedding size to 250, the hidden unit size to 250, the vocabulary size to 40000, and the beam size of the decoder to 5. For RNN-context-SDLM, we set the number of basis matrices to 3. We conduct a hyper-parameter search for both models (see Appendix C.2 for settings and optimal hyper-parameters).

Experimental Results
Following previous works, we report the F1-score of ROUGE-1, ROUGE-2, and ROUGE-L on the test set. Table 4 shows that our model outperforms the baseline model on all metrics. We attribute the improvement to the use of SDLM structure.
Words in headlines do not always appear in the corresponding articles. However, words with the same sememes have a high probability to appear in the articles intuitively. Therefore, a probable reason for the improvement is that our model could predict sememes highly relevant to the article, thus generate more accurate headlines. This could be corroborated by our case study.

Case Study
The above experiments demonstrate the effectiveness of our SDLM. Here we present some samples from the test set of the People's Daily Corpus in Table 5 as well as the LCSTS dataset in Table 6 and conduct further analysis. For each example of language modeling, given the context of previous words, we list the Top 5 words and Top 5 sememes predicted by our SDLM. The target words and the sememes annotated with them in HowNet are blackened. Note that if the target word is an out-of-vocabulary Example (1) ªt é˝8◆⌃Ó e 0°: <N> ⇥ The U.S. trade deficit last year is initially estimated to be <N> . Top 5 word prediction é é é C C C "dollar" "," ⇥ "." ÂC "yen" å "and" Top 5 sememe prediction F F F ⇢ ⇢ ⇢ "commerce" ---ç ç ç "finance" U U U M M M "unit" ⇢⌘ "amount" ◆ "proper name" Example (2) ? ;⌃ Ú~r Ü y }‰ ⇥ Albanian Prime Minister has signed an order. Top 5 word prediction Ö "inside" <unk> ( "at" T "tower" å "and" Top 5 sememe prediction ? ? ? "politics" ∫ ∫ ∫ "person" ±I "flowers" ≈ ≈ ≈˚"undertake" 4fl "waters" (OOV) word, helpful sememes that are related to the target meaning are blackened.
Sememes annotated with the corresponding sense of the target word éC "dollar" are UM "unit", F⇢ "commerce", -ç "finance", ' "money" and é˝"US". In Example (1), the target word "dollar" is predicted correctly and most of its sememes are activated in the predicting process. It indicates that our SDLM has learned the word-sense-sememe hierarchy and used sememe knowledge to improve language modeling.
Example (2) shows that our SDLM can provide interpretable results on OOV word prediction with sememe information associated with it. The target word here should be the name of the Albanian prime minister, which is out of vocabulary. But with our model, one can still conclude that this word is probably relevant to the sememe "politics", "person", "flowers", "undertake" and "waters", most of which characterize the meaning of this OOV word -the name of a politician. This feature can be helpful when the vocabulary size is limited or there are many terminologies and names in the corpus.
For the example of headline generation, given the article and previous words, when generating the word "student", except the sememe Ñô "predict", all other Top 5 predicted sememes have high relevance to either the predicted word or the context. To be specific, the sememe f`"study" is annotated with "student" in HowNet. ⇤ ' "exam" indicates "college entrance exam". y ö L P "brand" indicates "BMW". And ÿ I "higher" indicates "higher education", which is the next step after this exam. We can conclude that with sememe knowledge, our SDLM structure can extract critical information from both the given article and generated words explicitly and produce better summarization based on it.

Related Work
Neural Language Modeling. RNNs have achieved state-of-the-art performance in the language modeling task since Mikolov et al. (2010) first apply RNNs for language modeling. Much work has been done to improve RNN-based language modeling. For example, a variety of work (Zaremba et al., 2014;Gal and Ghahramani, 2016;Merity et al., 2017Merity et al., , 2018 introduces many regularization and optimization methods for RNNs. Based on the observation that the word `1 Â ⌘ " ÓM S∫ ⇤ Ú´ÿY ⇥ On the 8th in Fuxin, a male student drove a BMW to take the college entrance exam and was caught cheating. Because the teacher confiscated his mobile phone, he kicked the teacher from the last row to the podium and shouted: "Do you know who my dad is? How dare you catch me!" Currently, this student has been detained. Gold In the college entrance exam, a male student caught cheating hit the teacher: Do you know who my dad is? RNN-context-SDLM ÿ⇤ \⌦´ì ⇢`ÂS ⌘ 8 / J In the college entrance exam, a student was caught cheating: Do you know who my dad is? Top 5 sememe prediction ⇤ ⇤ ⇤ ' ' ' "exam" f f f`"study" y y y ö ö ö L L L P P P "brand" Ñô "predict" ÿ ÿ ÿ I I I "higher" appearing in the previous context is more likely to appear again, some work (Grave et al., 2017a,b) proposes to use cache for improvements. In this paper, we mainly focus on the output decoder, the module between the context vector and the predicted probability distribution. Similar to our SDLM, Yang et al. (2018) propose a high-rank model which adopts a Mixture of Softmaxes structure for the output decoder. However, our model is sememe-driven with each expert corresponding to an interpretable sememe.
Hierarchical Decoder Since softmax computation on large vocabulary is time-consuming, therefore being a dominant part of the model's complexity, various hierarchical softmax models have been proposed to address this issue. These models can be categorized to class-based models and tree-based models according to their hierarchical structure. Goodman (2001) first proposes the class-based model which divides the whole vocabulary into different classes and uses a hierarchical softmax decoder to model the probability as P(word) = P(word|class)P(class), which is similar to our model. For the tree-based models, all words are organized in a tree structure and the word probability is calculated as the probability of always choosing the correct child along the path from the root node to the word node. While Morin and Bengio (2005) utilize knowledge from Word-Net to build the tree, Mnih and Hinton (2008) build it in a bootstrapping way and Mikolov et al. (2013) construct a Huffman Tree based on word frequencies. Recently, Jiang et al. (2017) reform the tree-based structure to make it more efficient on GPUs. The major differences between our model and theirs are the purpose and the motivation. Our model targets at improving the performance and interpretability of language modeling using external knowledge in HowNet. Therefore, we take its philosophy of the word-sensesememe hierarchy to design our hierarchical decoder. Meanwhile, the class-based and tree-based models are mainly designed to speed up the softmax computation in the training process.

Sememe.
Recently, there are a lot of works concentrating on utilizing sememe knowledge in traditional natural language processing tasks. For example, Niu et al. (2017) use sememe knowledge to improve the quality of word embeddings and cope with the problem of word sense disambiguation.  apply matrix factorization to predict sememes for words. Jin et al. (2018) improve their work by incorporating character-level information. Our work extends the previous works and tries to combine word-sense-sememe hierarchy with the sequential model. To be specific, this is the first work to improve the performance and interpretability of Neural Language Modeling with sememe knowledge.
Product of Experts. As Hinton (1999Hinton ( , 2002 propose, the final probability can be calculated as the product of probabilities given by experts. Gales and Airey (2006) apply PoE to the speech recognition where each expert is a Gaussian mixture model. Unlike their work, in our SDLM, each expert is mapped to a sememe with better interpretability. Moreover, as the final distribution is a categorical distribution, each expert is only responsible for making predictions on a subset of the categories (usually less than 10), so we call it Sparse Product of Experts.
Headline Generation. Headline generation is a kind of text summarization tasks. In recent years, with the advances of RNNs, a lot of works have been done in this domain. The encoderdecoder models Cho et al., 2014) have achieved great success in sequenceto-sequence learning. Rush et al. (2015) propose a local attention-based model for abstractive sentence summarization. Gu et al. (2016) introduce the copying mechanism which is close to the rote memorization of the human being. Ayana et al. (2016) employ the minimum risk training strategy to optimize model parameters. Different from these works, we focus on the decoder of the sequence-to-sequence model, and adopt SDLM to utilize sememe knowledge for sentence generation.

Conclusion and Further Work
In this paper, we propose an interpretable Sememe-Driven Language Model with a hierarchical sememe-sense-word decoder. Besides interpretability, our model also achieves stateof-the-art performance in the Chinese Language Modeling task and shows improvement in the Headline Generation task. These results indicate that SDLM can successfully take advantages of sememe knowledge.
As for future work, we plan the following research directions: (1) In language modeling, given a sequence of words, a sequence of corresponding sememes can also be obtained. We will utilize the context sememe information for better sememe and word prediction. (2) Structural information about sememes in HowNet is ignored in our work. We will extend our model with the hierarchical sememe tree for more accurate relations between words and their sememes. (3) It is imaginable that the performance of SDLM will be significantly influenced by the annotation quality of sememe knowledge. We will also devote to further enrich the sememe knowledge for new words and phrases, and investigate its effect on SDLM.