Japanese Sentence Compression with a Large Training Dataset

In English, high-quality sentence compression models by deleting words have been trained on automatically created large training datasets. We work on Japanese sentence compression by a similar approach. To create a large Japanese training dataset, a method of creating English training dataset is modified based on the characteristics of the Japanese language. The created dataset is used to train Japanese sentence compression models based on the recurrent neural network.


Introduction
Sentence compression is the task of shortening a sentence while preserving its important information and grammaticality. Robust sentence compression systems are useful by themselves and also as a module in an extractive summarization system (Berg-Kirkpatrick et al., 2011;Thadani and McKeown, 2013). In this paper, we work on Japanese sentence compression by deleting words. One advantage of compression by deleting words as opposed to abstractive compression lies in the small search space. Another one is that the compressed sentence is more likely to be free from incorrect information not mentioned in the source sentence.
In recent years, a high-quality English sentence compression model by deleting words was trained on a large training dataset (Filippova and Altun, 2013;Filippova et al., 2015). While it is impractical to create a large training dataset by hand, one can be created automatically from news articles (Filippova and Altun, 2013). The procedure is as follows (where S, H, and C respectively denote the first sentence of an article, the headline, and the created compressed sentence of S). Firstly, to restrict the training data to grammatical and informative sentences, only news articles satisfying certain conditions are used. Then, nouns, verbs, adjectives, and adverbs (i.e., content words) shared by S and H are identified by matching word lemmas, and a rooted dependency subtree that contains all the shared content words is regarded as C.
However, their method is designed for English, and cannot be applied to Japanese as it is. Thus, in this study, their method is modified based on the following three characteristics of the Japanese language: (a) Abbreviation of nouns and nominalization of verbs frequently occur in Japanese. (b) Words that are not verbs can also be the root node especially in headlines. (c) Subjects and objects that can be easily estimated from the context are often omitted.
The created training dataset is used to train three models. The first model is the original Filippova et al.'s model, an encoder-decoder model with a long short-term memory (LSTM), which we extend in this paper to make the other two models that can control the output length (Kikuchi et al., 2016), because controlling the output length makes a compressed sentence more informative under the desired length.
2 Creating training dataset for Japanese Filippova et al.'s method of creating training data consists of the conditions imposed on news articles and the following three steps: (1) identification of shared content words, (2) transformation of a dependency tree, and (3) extraction of the min-  imum rooted subtree. We modified their method based on the characteristics of the Japanese language as follows. To explain our method, a dependency tree of S and a sequence of bunsetsu chunks of H in Japanese are shown in Figures 1 and 2. Note that nodes of dependency trees in Japanese are bunsetsu chunks each consisting of content words followed by function words.

Identification of shared content words
Content words shared by S and H are identified by matching lemmas and pronominal anaphora resolution in Filippova et al.'s method. Abbreviation of nouns and nominalization of verbs frequently occur in Japanese (characteristic (a) in Introduction), and it is difficult to identify these transformations simply by matching lemmas. Thus, after the identification by matching lemmas, two identification methods (described below) using character-level information are applied. Note that pronominal anaphora resolution is not used, because pronouns are often omitted in Japanese. Abbreviation of nouns: There are two types of abbreviations of nouns in Japanese. One is the abbreviation of a proper noun, which shortens the form by deleting characters (e.g., the pair with "+" in Figures 1 and 2). The other is the abbreviation of consecutive nouns, which deletes nouns that behave as adjectives (e.g., the pair with "-"). To deal with such cases, if the character sequence of the noun sequence in a chunk in H is identical to a subsequence of characters composing the noun sequence in a chunk in S, the noun sequences in H and S are regarded as shared.

Nominalization of verbs:
Many verbs in Japanese have corresponding nouns with similar meanings (e.g., the pair with # in Figures 1 and  2). Such pairs often share the same Chinese character, kanji. Kanji is an ideogram and is more informative than the other types of Japanese letters. Thus, if a noun in H and a verb in S 2 share one kanji, the noun and the verb are regarded as shared.

Transformation of a dependency tree
Some edges in dependency trees cannot be cut without changing the meaning or losing the grammaticality of the sentence. In Filippova et al.'s method, the nodes linked by such an edge are merged into a single node before extraction of the subtree. The method is adapted to Japanese as follows. If the function word of a chunk is one of the specific particles 3 , which often make obligatory cases, the chunk and its modifiee are merged into a single node. In Figure 1, the chunks at the start and end of a bold arrow are merged into a single node.

Extraction of the minimum rooted subtree
In Filippova et al.'s method, the minimum rooted subtree that contains all the shared content words is extracted from the transformed tree. We modify their method to take into account the characteristics (a) and (b).
Deleting the global root: In English, only verbs can be the root node of a subtree. However, in Japanese, words with other parts-of-speech can also be the root node in headlines (characteristics (b)). Therefore, the global root, which is the root node of S, and the chunks including a word that can be located at the end of a sentence 4 , are the candidates for the root node of a subtree. Then, if the root node is not the global root, words succeeding the word that can be located at the end are removed from the root node. In Figure 1, among the two words with "*" that can be located at the end, the latter is extracted as the root, and the succeeding word is removed from the chunk. Reflecting abbreviated forms: Abbreviation of nouns frequently occurs in Japanese (characteristic (a)). Thus, in C, original forms are replaced by their abbreviated forms obtained as explained in Section 2.1 (e.g., the pair with "-" in Figures  1 and 2). However, we do not allow the head of a chunk to be deleted to keep the grammaticality. We also restrict here ourselves to word-level deletion and do not allow character-level deletion, because our purpose is to construct a training dataset for compression by word deletion. In the example of Figure 1, chunks in bold squares are extracted as C.

Conditions imposed on news articles
Filippova et al.'s method imposed eight conditions on news articles to restrict the training data to grammatical and informative sentences. In our method, these conditions are modified to adapt to Japanese. Firstly, the condition "S should include content words of H in the same order" is removed, because word order in Japanese is relatively free. Secondly, the condition "S should include all content words of H" is relaxed to "the ratio of shared content words to content words in H is larger than threshold θ" because in Japanese, subjects and objects that can be easily estimated from the context are often omitted (characteristics (c)). In addition, two conditions "H should have a verb" and "H must not begin with a verb" are removed, leaving four conditions 5 .

Sentence compression with LSTM
Three models are used for sentence compression. Each model predicts a label sequence, where the label for each word is either "retain" or "delete" (Filippova et al., 2015). The first is an encoder-decoder model with LSTM (lstm) (Sutskever et al., 2014). Given an input sequence x = (x 1 , x 2 , . . . , x n ), y t in label sequence y = (y 1 , y 2 , . . . , y n ) is computed as follows: 5 "H is not a question", "both H and S are not less than four words.", "S is more than 1.5 times longer than H", and "S is more than 1.5 times longer than C".  Table 1: Created training data with each θ. ROUGE-2 is the score of compressed sentence generated by the model trained on each training data to target.
where ⊕ indicates concatenation, s t and m t are respectively a hidden state and a memory cell at time t, and e 1 (word ) is embedding of word. If label is "retain", e l (label) is (1; 0); otherwise, (0; 1). m 0 is a zero vector.
As the second and the third models, we extend the first model to control the output length (Kikuchi et al., 2016). The second model, lstm+leninit, initializes the memory cell of the decoder as follows: m 0 = tarlen * b len where tarlen is the desired output length, and b len is a trainable parameter. The third model, lstm+lenemb, uses the embedding of the potential desired length e 2 (length) as an additional input. In this case, e 1 (x t ) ⊕ e l (y t−1 ) ⊕ e 2 (l t ) is used as the input of the decoder where l t is the potential desired length at time t.

Experiments
The created training datasets were used to train three models for sentence compression. To see the effect of the modified subtree extraction method (Section 2.3), two training datasets were tested: rooted and multi-root+. rooted includes only deletion of the leafs in a dependency tree. In contrast, multi-root+ includes deleting the global root and reflecting abbreviated forms besides it. Setting: Training datasets were created from seven million, 35-years' worth of news articles from the Mainichi, Nikkei, and Yomiuri newspapers, from which duplicate sentences and sentences in test data were filtered out. Gold-standard data were composed of the first sentences of the 1,000 news articles from Mainichi, each of which has 5 compressed sentences separately created by five human annotators. 100 sentences of the gold-standard data were used as development data, while the other sentences were used as test data. The three models were trained on datasets created with each value of threshold θ 6 , which is the parameter used in the condition (introduced in Sec-  Table 2: Automatic evaluation. R-1, R-2, and R-L are ROUGE-1, ROUGE-2, and ROUGE-L score. R, P and F are recall, precision and F-measure. tion 2.4). The properties of the dataset in relation to θ are shown in Table 1. θ is tuned for ROUGE-2 score on the development data. In Table 1, we show the tendency of the ROUGE-2 scores when lstm+leninit was trained on created rooted with each θ. From Table 1, we chose 0.5 as θ. All models are three stacked LSTM layers 7 . Words with frequency lower than five are regarded as unknown words. ADAM 8 was used as the optimization method. The desired length was set to the bytes of a compressed sentence randomly chosen from the five human-generated sentences. In the test step, beam-search was used (beam size: 20) and candidates exceeding the desired length were truncated. tree-base and prop-w-dpnd: Existing methods for Japanese sentence compression are not based on the training on a large dataset. Therefore, the proposed method is compared with two methods, tree-base, (Filippova and Strube, 2008) and propw-dpnd (Harashima and Kurohashi, 2012), which are not based on supervised learning. tree-base is implemented as an integer linear programming problem that finds a subtree of a dependency tree.
prop-w-dpnd is also implemented as an integer linear programming problem, but modified based on characteristics of the Japanese language. prop-wdpnd allows the deletion inside the chunks. CR of the test data is approximately 61.0%. We think we should focus on F-measure of ROUGE-2 because of the different CR of the models. The models trained on a large training dataset achieved higher F-measure than the unsupervised models. Moreover, F-measure of lstm+leninit and lstm+lenemb, which were trained on either multi-root+ or rooted, is significantly better than propw-dpnd (p < 0.001). lstm achieved lower R-1 and R-2 than the other models trained on a large training dataset, probably because it tends to generate too short sentences, as indicated by low CR. It is also noteworthy that CRs of lstm+leninit and lstm+lenemb are mostly closer to the average CR of the test data than lstm. Furthermore, lstm+leninit and lstm+lenemb trained on multi-root+ instead of rooted worked better in terms of F-measure. We consider it is because various types of deletion make the compression model more flexible, as indicated by closer CR of the model trained on multi-root+ instead of rooted to the average CR of the test data.

Human evaluation
The difference between lstm+leninit and lstm+lenemb trained on multi-root+ was investigated first. With lstm+leninit, 2 out of 100 sentences, chosen randomly, ended with a word that cannot be located at the end of a sentence. In contrast, with lstm+lenemb, 24 sentences ended with such words and therefore are ungrammatical, although lenemb has shown to be effective in abstractive sentence summarization (Kikuchi et al., 2016). This result suggests that lstm+lenemb is excessively affected by the desired length because lenemb receives the potential desired length at each time of decoding. In fact, 21 out of the 24 sentences are as long as the desired length. Then, lstm+leninit trained on multi-root+ was evaluated by crowdsourcing in comparison with the gold-standard and tree-base. Each crowd-   Table 4: Human relative evaluation sourcing worker reads each source sentence and a set of the compressed sentences, reordered randomly, and gives a score from 1 to 5 (where 5 is best) to each compressed sentence in terms of informativeness (info) and readability (read). As shown in Table 3, in terms of both read and info, lstm+leninit archived higher scores than prop-wdpnd, and the difference is significant (p < 0.001).
Next, lstm+leninit was compared with lstm (both are trained on multi-root+), and multi-root+ was compared with rooted (lstm+leninit was used as the model), by human relative evaluation. In this evaluation, each worker votes for one of lstm+leninit, lstm, or tie, and for one of multi-root+, rooted, or tie. 250 votes in total were received (50 sentences × 5 votes). The results are shown in Table 4. lstm+leninit is better than lstm in terms of read, which is not directly related to the output length, as well as of info. It is also clear that multi-root+ achieves much higher info with a negligible reduction in read than rooted.

Conclusion
Filippova et al.'s method of creating training data for English sentence compression was modified to create a training dataset for Japanese sentence compression. The effectiveness of the created Japanese training dataset was verified by automatic and human evaluation. Our method can refine the created dataset by giving more flexibility in compression and achieved better informativeness with negligible reduction in readability. Furthermore, it has been shown that controlling the output length improves the performance of the sentence compression models.