Lexicon for Natural Language Generation in Spanish Adapted to Alternative and Augmentative Communication

In this paper we present Elsa , the first lexicon for Spanish with morphological, syntactic and semantic information automatically generated from a well-known pictogram resource and especially tailored for Augmentative and Alternative Communication ( AAC ). This lexicon, focusing on that specific icon set widely used within AAC applications, is motivated by the need to improve Natural Language Generation ( NLG ) systems to aid people who have been diagnosed to suffer from communication disorders. In addition, we design an automatic lexicon extension procedure by means of a training process to complete the linguistic data. For this we used a dataset composed of novels and tales in Spanish, with pictogram representations, since the lexicon is meant for AAC applications for children with disabilities. Moreover, we provide the algorithms used to build our lexicon and a use case of Elsa within an NLG sys-tem to observe the usability of our proposal.


Introduction
According to the State Database of Persons with Disabilities 2014 report 1 , 14,456 Spanish people had expression problems, 72,088 had mixed disabilities and 45,818 had communication disorders (Doval, 2013). Relying on unofficial sources 2 , in Spain and Mexico over 1% of children are autistic (over 800,000 people) requiring language aids. 1 Press release available Oct. 2016 at http://www.dependencia.imserso.es/Inter-Present2/groups/imserso/documents/binario/bdepcd_2014.pdf.
Our goal is to automatically create a Spanish vocabulary to be used in a Natural Language Generation (NLG) system applied to an AAC communicator, by merging different linguistic resources. Pictograms (used as input) act as a bridge between the lexicon and the NLG system, and help target users express themselves easily and quickly. Some previous AAC tools such as Talk Together or LetMe Talk 3 include small vocabulary packages with hand-coded knowledge, but none of them considers morphological, syntactic and semantic information when generating messages in Spanish. There is some work on language resource merging in the literature on manual and automatic management (Hughes, Souter, & Atwell, 1995;Crouch & King, 2005;Molinero, Sagot & Nicolas, 2009) but Spanish resources considering morphological, syntactic and semantic data has not been considered so far. Moreover, combining existing resources seemed a promising approach towards our goal, due to the grammatical difficulties of Spanish 4 , as there are fewer resources than in English (Janssen, 2005).
The rest of this work is organised as follows. In Section 2 we review the existing linguistic resources for Spanish and the process conducted to build Elsa. In Section 3 we present an automatic lexicon extension procedure. Then, in Section 4 we conduct an 12 evaluation of the created lexicon. In Section 5 we provide a use case of Elsa within an NLG system. Section 6 concludes the paper.

Reusing existing resources to build Elsa
The construction of Elsa begins with the selection of an available pictographic set for AAC users. After evaluating some possibilities 5 we choose the free and highly comprehensive Arasaac 6 set. Nonetheless, this dataset had to be preprocessed by removing the pictograms with the same meaning and word descriptions, whose redundancy is due to their different representation according to their colour. By doing so, we obtained our icon set with 9,411 pictograms, of which 6,970 have a single associated word (including proper names) and the rest are verbal phrases or compound proper names. Once this step is finished, it is necessary to add POS tags, syntactic and semantic information to each Arasaac word entry. For this purpose, we looked for available Spanish linguistic resources. We choose:  Lexicon of Spanish inflected forms 9 (LEFFE) (Molinero, Sagot, & Nicolas, 2009).
We start the process by extracting from each resource some information on the forms of the preprocessed icon set. Next, our approach follows the two well-defined steps (Crouch & King, 2005) between which we include a verification step: (1) we extract and map the form entries to a common format, adapted from Lexical Markup Framework (LMF) format (Francopoulo, et al., 2006); (2) we verify them at a lexical level in the DRAE 10 ; and finally (3) we combine the entries once they have been found 5 Some of them were: Pictographic Communication System (http://www.mayer-johnson.com/category/symbols-and-photos) and Pictogram (http://www.pictogram.se), which are not free; or Widgit (https://widgit.com) with no support for Spanish. 6 Created by the CATEDU, the Alborada Special Education Public School and Sergio Palao in 2008 under Creative Commons license. It contains over 16,000 pictograms with their associated words or sequence of words for Spanish, as well as multiple other languages. Available at http://www.catedu.es/arasaac. to be equivalent using the graph unification model (Necsulescu, Bel, Padró, Marimon & Revilla, 2011;. This operation is based on set unions of compatible feature values, allowing the validation of common information, the addition of differential information and the exclusion of inconsistencies.
The steps of extraction and mapping, verification and merging are explained in Algorithms 1, 2 and 3, respectively. In algorithm 4, we can observe the composition of the steps as explained in this Section.

Automatic lexicon extension
Keeping in mind that we intend to use this lexicon within an NLG system adapted to AAC users, and in order to facilitate the task of avoiding pictograms related to prepositions, we need to infer a priori which specific preposition follows a verb. The training process was performed using a dataset composed of novels and nearly five hundred tales in Spanish (Andersen, 2016;Anonymous, 2016;Grimm, 2016), previously POS-tagged applied with Freeling Tagger 11 , since we plan to use the lexicon in AAC applications for children with disabilities and these are the only contents with pictogram representation. In this regard, we are able to include more options beyond those present in the subcategorization frames for verbs taken from LEFFE and Adesse, such as those related to figurative language approaches 12 . Since this grammar realization is not present in the selected lexica, we developed a language model from a training process, considering bigrams and trigrams around verbs and using syntactic and semantic knowledge.

Experimental results
In order to evaluate the quality of Elsa, we first measured the coverage achieved after adding the information extracted from all resources. Table 1 shows the number of lemmas that were included in each resource. Table 2 shows the coverage of Elsa over the icon set. Our lexicon covered almost the 11 A library that provides multiple language analysis services, including probabilistic prediction of categories in unknown words (Atserias et al., 2006;Padró & Stanilovsky, 2012). entire icon set and most word entries include syntactic and semantic data essential to conduct the NLG process correctly. Moreover, Table 3 shows the number of lemmas and forms classified by categories. Most lemmas (3,165) are tagged as nouns, representing 7,035 inflected forms added to Elsa, whereas most forms (45,341) are tagged as verbs, representing 811 lemmas.  Figure 1 shows an example in adapted LMF format of the word entry desagradar 'displease' with morphological, syntactic and semantic data. Sem-Synset contains the semantic information from MCR. In addition to the conjugation in WordForm, SCF contains the syntactic information on the verb after combining the data extracted from LEFFE and Adesse. There is only one possible realization in the active form, where the subject is an estímulo 'stim- ulus' and the verb is followed by an indirect object 13 that is an experimentador 'experimenter'.
In SCF_training, a new preposition, a 'to' (not present in any of the selected resources), was inferred in order to use it within the subcategorization frame of the verb (before the indirect object oind). In addition, the Elsa entry desagradar 'displease' is linked to a pictogram image file from the icon set.   System output using Elsa: El tiempo desagradó al profesor ayer 'The weather displeased the teacher yesterday'. 13 In Spanish an intransitive verb has not direct object but it may be followed by an indirect object or other complements.
 System output without using Elsa: El tiempo desagradar el profesor ayer 'The weather displease the teacher yesterday' (where displease is the infinitive of the verb).
In this example, the system can determine that desagradar 'displease' is a verb, whose subject is tiempo 'weather', followed by an indirect object profesor 'teacher'. In addition, this verb needs the preposition a 'to' because profesor 'teacher' is a person. Besides, the presence of the adverb ayer 'yesterday' indicates that the tense is past. Our system would neither infer the additional elements needed nor the correct morphological inflections related to the syntactic and semantic features without the linguistic information provided by Elsa.

Conclusions
Elsa is an approach for lexica generation specially tailored for the needs of AAC applications. Besides including several types of linguistic information (morphology, syntax and semantics), a training process was executed to complete the subcategorization frames for verbs, like those present in figurative language. The resulting lexicon may be useful for assisting people with communication disabilities through NLG systems. In order to increase efficiency and precision, additional linguistic resources can be easily integrated due to the fact that the building process is automatic. To complete the semantic information, we propose to establish synonymy relations between the word entries to reuse their semantic classification and fill in the missing information.