Ensuring Readability and Data-fidelity using Head-modifier Templates in Deep Type Description Generation

A type description is a succinct noun compound which helps human and machines to quickly grasp the informative and distinctive information of an entity. Entities in most knowledge graphs (KGs) still lack such descriptions, thus calling for automatic methods to supplement such information. However, existing generative methods either overlook the grammatical structure or make factual mistakes in generated texts. To solve these problems, we propose a head-modifier template based method to ensure the readability and data fidelity of generated type descriptions. We also propose a new dataset and two metrics for this task. Experiments show that our method improves substantially compared with baselines and achieves state-of-the-art performance on both datasets.


Introduction
Large-scale open domain KGs such as DBpedia (Auer et al., 2007), Wikidata (Vrandečić and Krötzsch, 2014) and CN-DBpedia (Xu et al., 2017) are increasingly drawing the attention from both academia and industries, and have been successfully used in many applications that require background knowledge to understand texts.
In KGs, a type description (Bhowmik and de Melo, 2018) is a kind of description which reflects the rich information of an entity with little cognitive efforts. A type description must be informative, distinctive and succinct to help human quickly grasp the essence of an unfamiliar entity. Compared to other kinds of data in a KG, types in entity-typing task (Shimaoka et al., 2016;Ren et al., 2016) are too general and not informative enough (e.g., when asked about "what is rue Cazotte?", street in Paris, France is obviously more informative and distinctive than a type location.), and the fixed Figure 1: An example of the two-stage generation of our head-modifier template-based method. $hed$ and $mod$ are the placeholder for head and modifier components in the template. type set is too inflexible to expand; while infobox and abstract are too long with too much information, which increases cognitive burden.
Type descriptions are useful for a wide range of applications, including question answering (e.g. what is rue Cazotte?), named entity disambiguation (e.g. Apple (fruit of the apple tree) vs Apple (American technology company)), taxonomy enrichment, etc. However, many entities in current open-domain KGs still lack such descriptions. For example, in DBpedia and CN-DBpedia respectively, there are only about 21% and 1.8% entities that are provided with such descriptions 1 .
Essentially, a type description is a noun compound, which follows a grammatical rule called head-modifier rule (Hippisley et al., 2005;Wang et al., 2014). It always contains a head component (also head words or heads), and usually contains a modifier component (also modifier words or modifiers). The head component representing the type information of the entity makes it distinctive from entities of other types; the modifier component limits the scope of that type, making it more finegrained and informative. For example, in street in Paris, France, the head word street indicates that it is a street, and the modifier words Paris and France indicate the street is located in Paris, France.
Due to the low recall and limited patterns of extractive methods (Hearst, 1992), generative methods are more suitable to acquire more type descriptions. Generally, there are several challenges in generating a type description from an infobox: 1) it must be grammatically correct to be readable, given that a trivial mistake could lead to a syntax error (e.g. street with Paris, France); 2) it must guarantee the data fidelity towards input infobox, e.g., the system shouldn't generate street in Germany for a French street; 3) its heads must be the correct types for the entity, and a mistake in heads is more severe than in modifiers, e.g., in this case, river in France is much worse than street in Germany.
We argue that the head-modifier rule is crucial to ensure readability and data-fidelity in type description generation. However, existing methods pay little attention to it. Bhowmik and de Melo (2018) first propose a dynamic memorybased generative network to generate type descriptions from infobox in a neural manner. They utilize a memory component to help the model better remember the training data. However, it tends to lose the grammatical structure of the output, as it cannot distinguish heads from modifiers in the generation process. Also, it cannot handle the outof-vocabulary (OOV) problem, and many modifier words may be rare and OOV. Other data-totext (Wiseman et al., 2017;Sha et al., 2018) and text-to-text (Gu et al., 2016;Gulcehre et al., 2016;See et al., 2017) models equipped with copy mechanism alleviate OOV problem, without considering the difference between heads and modifiers, resulting in grammatical or factual mistakes.
To solve the problems above, we propose a head-modifier template-based method.
To the best of our knowledge, we are the first to integrate head-modifier rule into neural generative models. Our method is based on the observation that a head-modifier template exists in many type descriptions. For example, by replacing heads and modifiers with placeholders $hed$ and $mod$, the template for street in Paris, France is $hed$ in $mod$, $mod$, which is also the template for a series of similar type descriptions such as library in California, America, lake in Siberia, Russia, etc. Note that, the $hed$ and $mod$ can appear multiple times, and punctuation like a comma is also an important component of a template.
Identifying the head and modifier components is helpful for providing structural and contextual cues in content selection and surface realization in generation, which correspond to data fidelity and readability respectively. As shown in Fig.1, the model can easily select the corresponding properties and values and organize them by the guidance of the template. The head-modifier template is universal as the head-modifier rule exists in any noun compound in English, even in Chinese (Hippisley et al., 2005). Therefore, the templates are applicable for open domain KGs, with no need to design new templates for entities from other KGs.
There are no existing head-modifier templates to train from, therefore we use the dependency parsing technique (Manning et al., 2014) to acquire templates in training data. Then, as presented in Fig.1, our method consists of two stages: in Stage 1, we use an encoder-decoder framework with an attention mechanism to generate a template; in Stage 2, we use a new encoderdecoder framework to generate a type description, and reuse previously encoded infobox and apply a copy mechanism to preserve information from source to target. Meanwhile, we apply another attention mechanism upon generated templates to control the output's structure. We then apply a context gate mechanism to dynamically select contexts during decoding.
In brief, our contributions 2 in this paper include, 1) we propose a new head-modifier templatebased method to improve the readability and data fidelity of generating type descriptions, which is also the first attempt of integrating head-modifier rule into neural generative models; 2) we apply copy and context gate mechanism to enhance the model's ability of choosing contents with the guidance of templates; 3) we propose a new dataset with two new automatic metrics for this task, and experiments show that our method achieves stateof-the-art performance on both datasets.

Method
In this section, we demonstrate our method in detail. As shown in Fig.2, given an entity from Wikidata 3 and its corresponding infobox, we split the generation process into two stages. In Stage 1, the model takes as input an infobox and generates a head-modifier template. In Stage 2, the model takes as input the previously encoded infobox and the output template, and produces a type description. Note that our model is trained in an end-toend manner.

Stage 1: Template Generation
In this stage, we use an encoder-decoder framework to generate a head-modifier template of the type description.

Infobox Encoder
Our model takes as input an infobox of an entity, which is a series of (property, value) pairs denoted as I. We then reconstruct them into a sequence of words to apply Seq2Seq learning. In order to embed structural information from the infobox into word embedding x i , following Lebret et al. (2016), we represent bedding v x i for x i , a corresponding property embedding f x i and the positional information embedding p x i , and [·; ·] stands for vector concatenation. For example, as shown in Fig.3, we reconstruct (named after, Jacques Cazotte) into Jacques with (named after, 0) and Cazotte with (named after, 1), as Jacques is the first token in the value and Cazotte is the second. Next, we concatenate the embedding of Jacques, named after and 0 as the reconstructed embedding for Jacques. Notice that, we have three separate embedding matrices for properties, value words and position, that is, even though the property country is the same string as the value country, they are not the same token.
Then, we employ a standard GRU (Chung et al., 2014) to read the input X = {x i } Lx i=1 , then produce a sequence of hidden states H , which are shared in both stages, where L x is the length of the input sequence. In this task, the type descriptions are diversified yet following the head-modifier rule. The Stage 1 in our model learns the templates from training data, but there are no existing templates for the template generation training. Therefore, we acquire head-modifier templates by using a dependency parser provided by Stanford CoreNLP (Manning et al., 2014).
Specifically, a type description is formed by head words (or heads), modifier words (or modifiers) and conjunctions. In our work, we refer to words that are types as heads in a type description, so there could be multiple heads. For example, singer and producer in American singer, producer are both head words.
During dependency parsing, the root of a noun compound is always a head word of the type description. Therefore, we acquire heads by finding the root and its parallel terms. The remaining words except conjunctions and stopwords are considered to be modifiers. We then obtain the template by substituting heads with $hed$ and modifiers with $mod$, as shown in Fig.4.

Template Decoder
In template generation, the template decoder D 1 takes as input the previous encoded hidden states H x and produces a series of hidden states {s 1 1 , s 1 2 , ..., s 1 Lx } and a template sequence T = {t 1 , t 2 , ..., t Lt }, where L t is the length of the generated template. As template generation is a relatively lighter and easier task, we apply a canonical attention decoder as D 1 , with GRU as the RNN unit.
Formally, at each time step j, the decoder produces a context vector c 1 j , where η(s 1 j , h 1 i ) is a relevant score between encoder hidden state h 1 i and a decoder hidden state s 1 j . Among many ways to compute the score, in this work, we apply general product (Luong et al., 2015) to measure the similarity between both: where W 1 is a learnable parameter. Then the decoder state is updated by . Finally, the results are fed into a softmax layer, from which the system produces t j .

Stage 2: Description Generation
After Stage 1 is finished, the generated template sequence T and the infobox encoder hidden states H x are fed into Stage 2 to produce the final type description.

Template Encoder
As the template is an ordered sequence, we use a bidirectional (Schuster and Paliwal, 1997) GRU to encode template sequence into another series of hidden states H t = {h 2 i } Lt i=1 . Then we fed both H t and H x to the description decoder for further refinement.

Description Decoder
The description decoder D 2 is a GRU-based decoder, which utilizes a dual attention mechanism: a canonical attention mechanism and a copy mechanism to attend over template representation H t and infobox representation H x respectively. This is because we need the model to preserve information from the source while maintaining the headmodifier structure learned from the templates.
In detail, let s 2 j be D 2 's hidden state at time step j. The first canonical attention mechanism is similar to the one described in Section 2.1.3, except that the decoder hidden states are replaced and related learnable parameters are changed. By applying this, we obtain a context vector c t j of H t and a context vector c x j of H x .
Then, we use context gates proposed by Tu et al. (2017) to dynamically balance the contexts from infobox, template, and target, and decide the ratio at which three contexts contribute to the generation of target words.
Formally, we calculate the context gates g * j by where W * g , U * g , C * g are all learnable parameters, σ is a sigmoid layer, and e(y) embeds the word y. After that, we apply a linear interpolation to integrate these contexts and update the decoder state: where W, U, C 1 , C 2 are all learnable parameters.
To conduct a sort of slot filling procedure and enhance the model's ability of directly copying words from infobox, we further apply conditional copy mechanism (Gulcehre et al., 2016) upon H x . As the produced words may come from the vocabulary or directly from the infobox , we assume a new decoding vocabulary V = V ∪ {x i } Lx i=1 , where V is the original vocabulary with the vocabulary size of N , and unk is the replacement for out-of-vocabulary words.
Following Wiseman et al. (2017), the probabilistic function of y j is as follows: p(y j , z j |y <j , I, T ) = p copy (y j |y <j , I, T )p(z j |y <j , I), z j = 0 p gen (y j |y <j , I, T )p(z j |y <j , I), z j = 1 where z j is a binary variable deciding whether y j is copied from I or generated, and p(z j |·) is the switcher between copy and generate mode which is implemented as a multi-layer perceptron (MLP). p copy (y j |·) and p gen (y j |·) are the probabilities of copy mode and generate mode respectively, which are calculated by applying softmax on copy scores φ copy and generation scores φ gen . These scores are defined as follows: where W c , W g are both learnable parameters. Therefore, a word is considered as a copied word if it appears in the value portion of the source infobox.

Learning
Our model is able to be optimized in an end-toend manner and is trained to minimize the negative log-likelihood of the annotated templates T given infobox I and the ground truth type descriptions given T and I. Formally, log p(y i |y <i , I, T ) where L 1 is the loss in Stage 1, L 2 is the loss in Stage 2, and L y is the length of the target.

Experiments
In this section, we conduct several experiments to demonstrate the effectiveness of our method.

Datasets
We conduct experiments on two English datasets sampled from Wikidata, which are referred to as Wiki10K and Wiki200K respectively. Wiki10K is the original dataset proposed by Bhowmik and de Melo (2018), which is sampled from Wikidata and consists of 10K entities sampled from the official RDF exports of Wikidata dated 2016-08-01. However, this dataset is not only too small to reveal the subtlety of models, but it's also relatively imbalanced with too many human entities based on the property instance of. Therefore, we propose a new and larger dataset Wiki200K, which consists of 200K entities more evenly sampled from Wikidata dated 2018-10-01. Note that, in both Wiki10K and Wiki200K, we filter all the properties whose data type are not wikibase-item, wikibase-property or time according to Wikidata database reports 4 . KGs such as Wikidata are typically composed of semantic triples. A semantic triple is formed by a subject, a predicate, and an object, corresponding to entity, property and value in Wikidata.
We make sure that every entity from both datasets has at least 5 property-value pairs (or statement in Wikidata parlance) and an English type description. The basic statistics of the two datasets are demonstrated in Table 1. Then, we randomly divide two datasets into train, validation and test sets by the ratio of 8:1:1.  Table 1: Statistics for both datasets, where "#" denotes the number counted, and avg is short for average. "Copy(%)" denotes the copy ratio in the golden type descriptions excluding stopwords, which is similar to the metric ModCopy defined in Section 3.2.

Evaluation Metrics
Following the common practice, we evaluate different aspects of the generation quality with automatic metrics broadly applied in many natural language generation tasks, including BLEU (B-1, B-2) (Papineni et al., 2002), ROUGE (RG-L) (Lin, 2004), METEOR (Banerjee and Lavie, 2005) and CIDEr (Vedantam et al., 2015). BLEU measures the n-gram overlap between results and ground truth, giving a broad point of view regarding fluency, while ROUGE emphasizes on the precision and recall between both. METEOR matches human perception better and CIDEr captures human consensus. Nonetheless, these metrics depend highly on the comparison with ground truth, instead of the system's input. In this task, the output may still be correct judging by input infobox even if it's different from the ground truth. Therefore, we introduce two simple automatic metrics designed for this task to give a better perspective of the data fidelity of generated texts from the following aspects: • Modifier Copy Ratio (ModCopy). We evaluate the data fidelity regarding preserving source facts by computing the ratio of modifier words (that is, excluding stopwords and head words) in the type descriptions that are copied from the source. In detail, we roughly consider a word in a type description as a copied word if it shares a L-character (4 in our experiments) prefix with any word but stopwords in the values of source infobox. For example, modifier Japanese could be a copied modifier word from the fact (country, Japan). To clarify, the copy ratio of a type description can be calculated by • Head Accuracy (HedAcc). For a type description, it is crucial to make sure that the head word is the right type of entity. Therefore, in order to give an approximate estimate of the data fidelity regarding head words, we also evaluate the head word's accuracy in the output. Note that aside from ground truth, infobox is also a reliable source to provide candidate types. Specifically, in Wikidata, the values in instance of (P31) and subclass of (P279) are usually suitable types for an entity, though not every entity has these properties and these types could be too coarse-grained like human. Therefore, after dependency parsing, we count the head words in the output with heads from corresponding ground truth and values of corresponding infobox properties, then gives an accuracy of the heads of output. The Head Accuracy measures model's ability of predicting the right type of the entity.

Baselines and Experimental Setup
We compared our method with several competitive generative models. All models except DGN are implemented with the help of OpenNMT-py (Klein et al., 2017). Note that we use the same infobox reconstructing method described in Section 2.1.1 to apply Seq2Seq learning for all models except DGN since it has its own encoding method. The baselines include: • AttnSeq2Seq (Luong et al., 2015). AttnS2S is a standard RNN-based Seq2Seq model with an attention mechanism.
• Pointer-Generator (See et al., 2017). Ptr-Gen is originally designed for text summarization, providing a strong baseline with a copy mechanism. Note that, in order to make  a fairer comparison with our model, we additionally equip Ptr-Gen with context gate mechanism so that it becomes a no-template version of our method.
• Transformer (Vaswani et al., 2017). Transformer recently outperforms traditional RNN architecture in many NLP tasks, which makes it also a competitive baseline, even if it's not specifically designed for this task.
• DGN (Bhowmik and de Melo, 2018). DGN uses a dynamic memory based network with a positional encoder and an RNN decoder. It achieved state-of-the-art performance in this task.
In experiments, we decapitalize all words and keep vocabularies at the size of 10,000 and 50,000 for Wiki10K and Wiki200K respectively, and use unk to represent other out-of-vocabulary words.
For the sake of fairness, the hidden size of RNN (GRU in our experiments) and Transformer in all models are set to 256. The word embedding size is set to 256, and the property and position embedding sizes are both set to 128. During training, we use Adam (Kingma and Ba, 2014) as the optimization algorithm.

Results and Analysis
The experimental results of metrics described in Section 3.2 are listed in Table 2. In general, our method achieves state-of-the-art performance over proposed baselines.
As shown in the table, our method improves substantially compared with standard encoderdecoder models (AttnS2S and Transformer) and the previous state-of-the-art method (DGN). Interestingly, DGN is out-performed by Ptr-Gen in Wiki10K and by most of the models in the larger dataset Wiki200K. We also notice that Transformer performs much better on Wiki200K, which is most likely because of its learning ability through massive training data. These results further prove the necessity of proposing our new dataset. Among baselines, Ptr-Gen achieves relatively better results due to copy mechanism and context gate mechanism. These mechanisms give the model the ability to cope with the OOV problem and to directly preserve information from the source, which is important in this task. Note that, as described in Section 3.3, we enhance the Pointer-Generator to become a no-template version of our model, therefore the effect of the headmodifier template can be measured by comparing the results of these two methods. And the results demonstrate that our head-modifier template plays an important role in generating type descriptions.
In terms of the two proposed metrics, we find these metrics roughly positively correlated with traditional metrics, which in a way justifies our metrics. These metrics provide interesting points of view on measuring generation quality. The performance on ModCopy indicates that methods (Ptr-Gen, ours) with copy mechanism improves data fidelity by copying facts from the source, and the template helps the model know where and how to copy. The performance on HedAcc demonstrates that our method is relatively better at predicting types for an entity, which in a way suggests the templates help the generated text maintain the head-modifier structure so that the head word is successfully parsed by the dependency parsing technique. Although, we notice that in Wiki200K, models perform relatively worse on ModCopy and better on HedAcc than in Wiki10K. This is most likely because the types of entities are finite, and more training data leads to more accuracy in predicting types. Due to the size of the dataset and the limit of vocabulary size, the factual information is harder to preserve in the output. This again proves the necessity of the new dataset.

Manual Evaluation
In this task, the readability of the generated type description is mostly related to its grammatical correctness, which benefits from the headmodifier templates. Therefore, in order to measure the influence the templates make in terms of readability as well as how ModCopy (M.C.) and HedAcc (H.A.) correlate with manual judgment, we manually evaluate the generation from two aspects: Grammar Accuracy (G.A.) and Overall Accuracy (O.A.). In detail, Grammar Accuracy is the grammatical correctness judging by the grammar of the generated text alone; Overall Accuracy is the grammatical and de facto correctness of the generated type description given an infobox and the ground truth. Note that Overall Accuracy is always lower than or equal to Grammar Accuracy.
In our experiment, we randomly select 200 pieces of data from the test set of Wiki200K, and provide the results of each method to the volunteers (who are all undergraduates) for manual evaluation. We make sure each result is evaluated by two volunteers so as to eliminate the influence of subjective factors to some extent.  Table 3: Results of manual evaluation as well as two proposed metrics.
The results, as shown in Table 3, prove again the effectiveness of our method. Our method outperforms other baselines in term of Grammar Accuracy, which demonstrates that the model benefits from the head-modifier templates in term of readability by knowing "how to say it". In particular, the templates improves the Grammar Accuracy substantially compared with Ptr-Gen. Results on the Overall Accuracy indicate that our method ensures readability as well as data-fidelity, which indicates that the model benefits from the templates by knowing "what to say". As for the proposed metrics ModCopy and HedAcc, they are, in line with intuition, relatively positively correlated with human judgment in general. Also, notice that the statistics on both metrics are consistent with Table  2.

Effect of Templates
We aim to investigate whether the model is able to correct itself if the template generated in Stage 1 deviates from the correct one. We select cases from Wiki10K test set to conduct experiments. During inference, we deliberately replace the template in Stage 2 to see if the generated text still complies with the given template or if the model will be able to generate the right type description.  Examples of replacing templates. Template 1's are the inital generated templates, while the remaining ones are produced by the authors. We use bold to denote the heads and use italic red to denote mistaken words.
The experimental results, as presented in Fig.  5, show our method's resilience against mistaken templates. In the first case: 1) the replaced template Template 2 is obviously inconsistent with the golden template Template 1 (though it's also a possible template for other type descriptions), yet the model still manages to generate a type description though paris is lost; 2) Template 3 doesn't have the conjunction in, which causes confusion but the model still successfully predicts the right head.
In the second case, the model originally generates repetitive heads: 1) in Template 2, we delete the second $hed$ in Template 1, and as a result, the model successfully generates a correct though incomplete output; 2) while Template 3 is completely wrong judging by the head-modifier rule, and as a result Output 3 is lost in readability. Nevertheless, due to the fact that the number of type descriptions is infinite yet the number of head-modifier templates is rather finite, the model can hardly generate a template that's completely wrong, therefore this scenario rarely happens in real life. Still, the model tries to maintain a similar structure and successfully keeps data fidelity by predicting teacher, and preserving italy.

Related Work
There has been extensive work on mining entitytype pairs (i.e. isA relations) automatically. Hearst (1992) uses a pattern-based method to extract isA pairs directly from free text with Hearst Patterns (e.g., N P 1 is a N P 2 ; N P 0 such as {N P 1 , N P 2 , ..., (and|or)}N P n ) from which taxonomies can be induced (Poon and Domingos, 2010;Velardi et al., 2013;Bansal et al., 2014). But these methods are limited in patterns, which often results in low recall and precision.
The most related line of work regarding predicting types for entities is entity-typing (Collins and Singer, 1999;Jiang and Zhai, 2006;Ratinov and Roth, 2009), which aims to assign types such as people, location from a fixed set to entity mentions in a document, and most of them model it a classification task. However, the types, even for those aiming at fine-grained entity-typing (Shimaoka et al., 2016;Ren et al., 2016;Anand et al., 2017) are too coarse-grained to be informative about the entity. Also, the type set is too small and inflexible to meet the need for an everexpanding KG.
In this task, the structured infobox is a source more suitable than textural data compared with text summarization task (Gu et al., 2016;See et al., 2017;Cao et al., 2018), because not every entity in a KG possesses a paragraph of description. For example, in CN-DBpedia (Xu et al., 2017), which is one of the biggest Chinese KG, only a quarter of the entities have textual descriptions, yet almost every entity has an infobox.
Natural language generation (NLG) from structured data is a classic problem, in which many efforts have been made. A common approach is to use hand-crafted templates (Kukich, 1983;McKeown, 1992), but the acquisition of these templates in a specific domain is too costly. Some also focus on automatically creating templates by clustering sentences and then use hand-crafted rules to induce templates (Angeli et al., 2010;Konstas and Lapata, 2013). Recently with the rise of neural networks, many methods generate text in an endto-end manner Wiseman et al., 2017;Bhowmik and de Melo, 2018). However, they pay little attention to the grammatical structure of the output which may be ignored in generating long sentences, but it is crucial in generating short noun compounds like type descriptions.

Conclusion and Future Work
In this paper, we propose a head-modifier template-based type description generation method, powered by a copy mechanism and context gating mechanism. We also propose a larger dataset and two metrics designed for this task. Experimental results demonstrate that our method achieves state-of-the-art performance over baselines on both datasets while ensuring data fidelity and readability in generated type descriptions. Further experiments regarding the effect of templates show that our model is not only controllable through templates, but resilient against wrong templates and able to correct itself. Aside from such syntax templates, in the future, we aim to explore how semantic templates contribute to type description generation.