Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata

While Wikipedia exists in 287 languages, its content is unevenly distributed among them. In this work, we investigate the generation of open domain Wikipedia summaries in underserved languages using structured data from Wikidata. To this end, we propose a neural network architecture equipped with copy actions that learns to generate single-sentence and comprehensible textual summaries from Wikidata triples. We demonstrate the effectiveness of the proposed approach by evaluating it against a set of baselines on two languages of different natures: Arabic, a morphological rich language with a larger vocabulary than English, and Esperanto, a constructed language known for its easy acquisition.


Introduction
Despite the fact that Wikipedia exists in 287 languages, the existing content is unevenly distributed. The content of the most under-resourced Wikipedias is maintained by a limited number of editors -they cannot curate the same volume of articles as the editors of large Wikipedia languagespecific communities. It is therefore of the utmost social and cultural interests to address languages for which native speakers have only access to an impoverished Wikipedia. In this paper, we propose an automatic approach to generate textual summaries that can be used as a starting point for the editors of the involved Wikipedias. We propose an end-to-end trainable model that generates a textual summary given a set of KB triples as input. We apply our model on two languages that have a severe lack of both editors and articles on Wikipedia: Esperanto is an easily acquired artificially created language which makes it less data needy and a more suitable starting point † The authors contributed equally to this work.
for exploring the challenges of this task. Arabic is a morphologically rich language that is much more challenging to work, mainly due to its significantly larger vocabulary. As shown in Table 1 both Arabic and Esperanto suffer a severe lack of content and active editors compared to the English Wikipedia which is currently the biggest one in terms of number of articles. Our research is mostly related to previous work on adapting the general encoder-decoder framework for the generation of Wikipedia summaries (Lebret et al., 2016;Chisholm et al., 2017;. Nonetheless, all these approaches focus on task of biographies generation, and only in Englishthe language with the most language resources and knowledge bases available. In contrast with these works, we explore the generation of sentences in an open-domain, multilingual context. The model from (Lebret et al., 2016) takes the Wikipedia infobox as an input, while (Chisholm et al., 2017) uses a sequence of slot-value pairs extracted from Wikidata. Both models are only able to generate single-subject relationships. In our model the input triples go beyond the single-subject relationships of a Wikipedia infobox or a Wikidata page about a specific item (Section 2). Similarly to our approach, the model proposed by  accepts a set of triples as input, however, it leverages instance-type-related information from DBpedia in order to generate text that addresses rare or unseen entities. Our solution is much broader since it does not rely on the assumption that unseen triples will adopt the same pattern of properties and entities' instance types pairs as the ones that have been used for training. To this end, we use copy actions over the labels of entities in the input triples. This relates to previous works in machine translation which deals with rare or unseen word problem for translating names and numbers in text. (Luong et al., 2015)   propose a model that generates positional placeholders pointing to some words in source sentence and copy it to target sentence (copy actions). (Gulcehre et al., 2016) introduce separate trainable modules for copy actions to adapt to highly variable input sequences, for text summarisation. For text generation from tables, (Lebret et al., 2016) extend positional copy actions to copy values from fields in the given table. For Question Generation, (Serban et al., 2016) use a placeholder for the subject entity in the question to generalise to unseen entities. We evaluate our approach by measuring how close our synthesised summaries can be to actual summaries in Wikipedia against two other baselines of different natures: a language model, and an information retrieval template-based solution. Our model substantially outperforms all the baselines in all evaluation metrics in both Esperanto and Arabic. In this work we present the following contributions: i) We investigate the task of generating textual summaries from Wikidata triples in underserved Wikipedia languages across multiple domains, and ii) We use an end-toend model with copy actions adapted to this task. Our datasets, results, and experiments are available at: https://github.com/pvougiou/ Wikidata2Wikipedia.

Model
Our approach is inspired by similar encoderdecoder architectures that have already been employed on similar text generative tasks (Serban et al., 2016;.

Encoding the Triples
The encoder part of the model is a feed-forward architecture that encodes the set of input triples into a fixed dimensionality vector, which is subsequently used to initialise the decoder. Given a set of un-ordered triples F E = {f 1 , f 2 , . . . , f R : f j = (s j , p j , o j )}, where s j , p j and o j are the onehot vector representations of the respective sub-ject, property and object of the j-th triple, we compute an embedding h f j for the j-th triple by forward propagating as follows: where h f j is the embedding vector of each triple f j , h F E is a fixed-length vector representation for all the input triples F E . q is a non-linear activation function, [. . . ; . . .] represents vector concatenation. W in ,W h ,W F are trainable weight matrices. Unlike (Chisholm et al., 2017), our encoder is agnostic with respect to the order of input triples. As a result, the order of a particular triple f j in the triples set does not change its significance towards the computation of the vector representation of the whole triples set, h F E .

Decoding the Summary
The decoder part of the architecture is a multilayer RNN (Cho et al., 2014) with Gated Recurrent Units which generates the textual summary one token at a time. The hidden unit of the GRU at the first layer is initialised with h F E . At each timestep t, the hidden state of the GRU is calculated as follows: The conditional probability distribution over each token y t of the summary at each timestep t is computed as the softmax(W out h L t ) over all the possible entries in the summaries dictionary, where h L t is the hidden state of the last layer and W out is a biased trainable weight matrix. A summary consists of words and mentions of entity in the text. We adapt the concept of surface form tuples  in order to be able to learn an arbitrary number of different lexicalisations of the same entity in the summary (e.g. "aktorino", "aktoro"). Figure 1 shows the architecture of our generative model when it is provided with the three triples of the idealised example of Table 2.

Copy Actions
Following (Luong et al., 2015;Lebret et al., 2016) we model all the copy actions on the data level through a set of special tokens added to the basic vocabulary. Rare entities identified in text and existing in the input triples are being replaced by the token of the property of the relationship to which it   was matched. We refer to those tokens as property placeholders. In Table 2, [[P17]] in the vocabulary extended summary is an example of property placeholder -would it be generated by our model, it is replaced with the label of the object of the triple with which they share the same property (i.e. Q490900 (Floridia) P17 (ŝtato) Q38 (Italio)). When all the tokens of the summary are sampled, each property placeholder that is generated is mapped to the triple with which it shares the same property and is subsequently replaced with the textual label of the entity. We randomly choose an entity, in case there are more than one triple with the same property in the input triples set.

Implementation and Training Details
We implemented our neural network models using the Torch 1 package. We included the 15, 000 and 25, 000 most frequent tokens (i.e. either words or entities) of the summaries in Esperanto and Arabic respectively for target vocabulary of the textual summaries. Using a larger size of target dictionary in Arabic is due to its greater linguistic variability -Arabic vocabulary is 47% larger than Esperanto vocabulary (cf. Table 1). We replaced any rare enti-ties in the text that participate in relations in the aligned triples set with the corresponding property placeholder of the upheld relations. We include all property placeholders that occur at least 20 times in each training dataset. Subsequently, the dictionaries of the Esperanto and Arabic summaries are expanded by 80 and 113 property placeholders respectively. In case the rare entity is not matched to any subject or object of the set of corresponding triples it is replaced by the special <resource> token. Each summary is augmented with the respect start-of-summary <start> and end-ofsummary <end> tokens.
For the decoder, we use 1 layer of GRUs. We set the dimensionality of the decoder's hidden state to 500 in Esperanto and 700 in Arabic. We initialise all parameters with random uniform distribution between −0.001 and 0.001, and we use Batch Normalisation before each non-linear activation function and after each fully-connected layer (Ioffe and Szegedy, 2015) on the encoder side . During training, the model tries to learn those parameters that minimise the sum of the negative log-likelihoods of a set of predicted summaries. The networks are trained using mini-batch of size 85. The weights are updated using Adam (Kingma and Ba, 2014) (i.e. it was found to work better than Stochastic Gradient Descent, RMSProp and AdaGrad) with a learning rate of 10 −5 . An l 2 regularisation term of 0.1 over each network's parameters is also included in the cost function.
The networks converge after the 9th epoch in the Esperanto case and after the 11th in the Arabic case. During evaluation and testing, we do beam search with a beam size of 20, and we retain only the summary with the highest probability. We found that increasing the beam size resulted not only in minor improvements in terms of performance but also in a greater number of fullycompleted generated summaries (i.e. summaries for which the special end-of-summary <end> to-  ken is generated).

Dataset
In order to train our models to generate summaries from Wikidata triples, we introduce a new dataset for text generation from KB triples in a multilingual setting and align it with the triples of its corresponding Wikidata Item. For each Wikipedia article, we extract and tokenise the first introductory sentence and align it with triples where its corresponding item appears as a subject or an object in the Wikidata truthy dump. In order to create the surface form tuples (i.e. Section 2.3), we identify occurrences of entities in the text along with their verbalisations. We rely on keyword matching against labels from Wikidata expanded by the global language fallback chain introduced by Wikimedia 2 to overcome the lack of non-English labels in Wikidata . For the property placeholders, we use the distant supervision assumption for relation extraction (Mintz et al., 2009). Entities that participate in relations with the main entity of the article are being replaced with their corresponding property placeholder tag. Table 3 shows statistics on the two corpora that we used for the training of our systems.

Baselines
To demonstrate the effectiveness of our approach, we compare it to two competitive systems.
KN is a 5-gram Kneser-Ney (KN) (Heafield et al., 2013) language model. KN has been used before as a baseline for text generation from structured data (Lebret et al., 2016) and provided competitive results on a single domain in English. We 2 https://meta.wikimedia.org/wiki/ Wikidata/Notes/Language_fallback also introduce a second KN model (KN ext ), which is trained on summaries with the special tokens for copy actions. During test time, we use beam search of size 10 to sample from the learned language model.
IR is an Information Retrieval (IR) baseline similar to those that have been used in other text generative tasks (Rush et al., 2015;Du et al., 2017). First, the baseline encodes the list of input triples using TF-IDF followed by LSA (Halko et al., 2011). For each item in the test set, we perform K-nearest neighbors to retrieve the vector from the training set that is the closest to this item and output its corresponding summary. Similar to KN baseline, we provide two versions of this baseline IR and IR ext .

Results and Discussion
We evaluate the generated summaries from our model and each of the baselines against their original counterparts from Wikipedia. Triples sets whose generated summaries are incomplete 3 (i.e. summaries for which the special end-of-summary <end> token is generated) are excluded from the evaluation. We use a set of evaluation metrics for text generation: BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014) and ROUGE L (Lin, 2004). As displayed in Ta-ble 4, our model shows a significant enhancement compared to our baselines across the ma-3 Around ≤ 1% and 2% of the input validation and test triples sets in Arabic and Esperanto respectively led to the generation of summaries without the <end> token. We believe that this difference is explained by the limited size of the Esperanto dataset that increases the level of difficulty that the trained models (i.e. with or without Copy Actions) to generalise on unseen data.  Table 4: Automatic evaluation of our model against all other baselines using BLEU 1-4, ROUGE and METEOR for both Arabic and Esperanto Validation and Test set jority of the evaluation metrics in both languages. We achieve at least an enhancement of at least 5.25 and 1.31 BLEU 4 score in Arabic and Esperanto respectively over the IR ext , the strongest baseline. The introduction of the copy actions to our encoder-decoder architecture enhances our performance further by 0.61 − 1.10 BLEU (using BLEU 4). In general, our copy actions mechanism benefits the performance of all the competitive systems.
Generalisation Across Domains. To investigate how well different models can generalise across multiple domains, we categorise each generated summary into one of 50 categories according to its main entity instance type (e.g. village, company, football player). We examine the distribution of BLEU-4 scores per category to measure how well the model generalises across domains ( Figure 2). We show that i) the high performance of our system is not skewed towards some domains at the expense of others, and that ii) our model has a good generalisation across domainsbetter than any other baseline. Despite the fact that the Kneser-Ney template-based baseline (KN ext ) has exhibited competitive performance in a singledomain context (Lebret et al., 2016), it is failing to generalise in our multi-domain text generation scenario.

Conclusions
In this paper, we show that with the adaptation of the encoder-decoder neural network architecture for the generation of summaries we are able to overcome the challenges introduced by working with underserved languages. This is achieved by leveraging data from a structured knowledge base and careful data preparation in a multilingual fashion, which are of the utmost practical interest for our under-resourced task, that would have otherwise required a substantial additional amount of data. Our model was able to perform and generalise across domains better than a set of strong baselines.