Personalized Questions, Answers and Grammars: Aiding the Search for Relevant Web Information

This work proposes an organization of knowledge to facilitate the generation of personalized questions, answers and grammars from web documents. To reduce the human effort needed in the generation of the linguistic resources for a new domain, the general aspects that can be reuse across domains are separated from those more specific. The proposed approach is based on the representation of the main domain concepts as a set of attributes. These attributes are related to a syntactico-semantic taxonomy representing the general relationships between conceptual and linguistic knowledge. User models are incorporated by distinguishing different user groups and relating each group to the appropriate conceptual attributes. Then, the data is extracted from the web documents and represented as instances of the domain concepts. Questions, answers and grammars are generated from these instances.


Introduction
The large amount of data and services available on the web has increased the need of tools that may assist the different types of users when looking for information. The approach described in this paper to guide the user search consists of providing the most relevant data in a particular domain as a set of questions and their corresponding answers.
Presenting the main questions (and their answers) could be valuable in different types of scenarios, especially when the information to search is voluminous and/or when the user is looking for relevant data that has to be understood perfectly. For example, including most relevant questions and answers in the web description of academic courses could result useful for students, as described in the next sections.
The generation of personalized questions for a specific domain involves reasoning skills as well as domain and linguistic knowledge. To reduce the human effort needed in this process, this work proposes a general organization of the conceptual and linguistic knowledge involved, thus limiting the specific data that has to be incorporated for a new domain. In this proposal, the main domain concepts are described by a set of attributes and those attributes are related to user models and to a syntactico-semantic taxonomy, which represents the general relationships between conceptual and linguistic knowledge. This taxonomy, described in a previous work (Gatius, 2013), was defined following (Bateman et al, 1994). It is used for generating questions, answers and grammar and also for extracting data from the web.
This work is focused on the generation of questions and answers from (semi)structured web documents describing particular cases of general concepts (i.e., university courses and types of foods). Information from these documents can be automatically extracted and represented as instances of the general concepts, previously described by the expert. Questions, answers and grammars can be automatically generated from the resulting instances.
The next section gives an overview of the approach proposed together with its adaptation in several languages (English, Catalan and Spanish) for two different domains: university courses and cultural events. The Section 3 describes the implementation and evaluation done. Finally, related work and discussion is given in the last section

Approach Overview
The approach proposed to generate personalized questions, answers and grammars from web documents in a particular domain is based in a separated representation of the different types of knowledge involved. This approach consists of the following five steps: 1. Representing the most relevant domain concepts as a set of attributes. 2. Relating the attributes to the taxonomy. 3. Relating the attributes to the users groups. 4. Extracting the web data. 5. Generating the personalized questions and answers (or grammars).
The first three steps, studied in a previous work (Gatius, 2015), have to be done by a human expert. First, the main concepts have to be defined by a set of attributes and these attributes by facets, which describe details such as the cardinality.
In a second step, the conceptual attributes have to be associated with the classes in the syntacticosemantic taxonomy, defined in previous work. These classes are associated with the linguistic structures involved in the questions and answers about the conceptual attributes. They can be easily adapted to new languages. Figure 1 shows a partial description of the concepts involved in this scenario: Course and Exam. The course description is represented by a set of attributes. Each of the course exam is represented by an attribute, which value is an instance of the concept Exam. As can be seen in the Figure, the conceptual attributes have been associated with the corresponding syntactico-semantic classes. For example, the attributes code and content are related to the class of, corresponding to general descriptions and there are others attributes related to its subclasses, describing more precise information: of_quantity (i.e., the credits, assessment and weight), of_time, of_date and of_place. User models can also be incorporated by classifying users in groups and associating each conceptual attribute with the group interested on it. In several scenarios, stereotypes could also be related to different values of the attribute. For example, in the nutrition domain, the values of the attributes describing the number of calories and physical activity needed daily are different for each group (women, men and children).
In the academic scenario two user groups can be distinguished: students and teachers. The attributes describing the course are considered relevant for the two groups, except for the attribute Code, only interesting for teachers.
The two last steps proposed can be done automatically, although human supervision of the data obtained is needed. First, from the web documents selected, the appropriate data is extracted and represented as instances of the domain concepts. The values of the conceptual attributes are obtained using general rules defined for that purpose. Finally, the questions and answers (or grammars) are generated from the conceptual instances.

Personalized Questions and Answers
Most university web sites include clear and detailed descriptions about their courses. However, frequently, students ask teachers about this information, especially that related to the exams. For this reason, including relevant questions and their answers in the course description could help.
The extraction of the data in this domain requires a limited human effort, because the descriptions of university courses usually include similar content and are presented in a (semi)structured form. Several web documents from different faculties in the same university have been analyzed.
The web description of the courses analyzed, is placed in separated documents, with different formats: The particular data related to the exams (date, time and room) is presented by tables while more general information is in textual form.
The data related to a particular course is extracted from the web documents and represented as instances of the concepts Course and Exam. For this purpose, domain independent rules that use the facets describing the attributes (type, related terms and cardinality) are used. The data extracted is represented as the values of the instance attributes.
The general rule for obtaining the value of an attribute from a textual document is: "If the attribute related terms (or synonyms) are found, then extract the context words that correspond to the type of the attribute".
In this rule, context is a variable that indicates the maximum number of words before or after the attribute terms that have to be considered. Its value is obtained by analyzing the domain documents.
A condition to this general rule has been added to extract all possible values of the attributes, considering its cardinality.
Using this rule, the data describing the final exam of a particular course is obtained from the document giving general data and represented as the instance shown in Figure 2 (which belongs to concept Exam in Figure 1).  If one of the course identifiers is found in a row then extract the next words in the row that correspond to the type of the attribute. Figure 3 shows examples of the generated questions and answers obtained from the instance Final Exam in Figure 2.

Generating Personalized Grammars
Language interfaces have also been used to assist the user when accessing the web. They can incorporate domain-restricted grammars to help the user about the contents and the terms to be used to build the query, as can be seen in Figure 4. Those semantic grammars can also be generated following the approach proposed. The processing of the resulting query is simple, because the language to be considered is limited, i.e., the user will describe time by selecting one of the forms in the screen.  Figure 4 shows an example of how the user is guided to build a query to a web service giving information about the cultural events in a particular city. In this scenario, the grammar used has been generated from the concept Event, described by a set of attributes that correspond to the parameters of the service: title, type, place, time and audience. The same concept could be adapted for many of the web services about cultural activities. Two different user groups are distinguished, considering the value of the attribute audience, if they are interested on activities for adults or for children.
The interface shown in Figure 4 has been generated by the Grammatical Framework (GF, www.gramaticalframework.org), from the grammar written in the GF formalism, although other formalisms and environments could also be used.

Implementation and Evaluation
The web documents are first analyzed and classified in two groups considering their structure. Domain and language independent rules to extract the relevant data from these two types of documents have been defined and implemented in C language.
To automatically generate the personalized linguistic resources from the domain concepts, a Prolog program has also been developed. Prolog is an appropriate language because its unification mechanism facilitates the association of general conceptual categories with features indicating additional information: stereotype, language and syntactic details (such as gender, number and tense). The questions and answers related to the exams of a particular course on introduction to programming were generated in three languages (English, Catalan and Spanish) and included in the course web page. In order to evaluate their usability, the students of two different degrees were asked to complete, anonymously, an online questionnaire, included in the same web page. There were 26 students in the Group 1, enrolled in the Bachelor's Degree of Aerospace Vehicle Engineering and 27 in Group 2, in the Bachelor's Degree of Industrial Technology Engineering. Table 1 shows the questions and their results, rating scales are from 0 to 10, 0 strongly disagree, 10 strongly agree. This result indicates that students think the generated questions and answers are useful: 8.4 over 10 in Group 1 and 8.12 in Group 2. Similar results were obtained from students in the same degrees, in an informal evaluation, done the previous semester.

Related Work and Discussion
The generation of questions and answers has focused many research works in different areas, such as educational (Wyse and Piwek, 2009) and conversational systems (Varges et al., 2006), (Okoye et al., 2011).
There are different techniques that can be used for generation, based on rules (Mazidi and Tarau, 2009) and/or statistical methods (Jin and Le, 2016). Those techniques can be adapted to textual documents and/or to structured data (Duma and Klein, 2013). In the first case, the generation process is usually done by applying rules to the trees obtained from the syntactic analysis (Nouri et al., 2011), although there are also works that use the resulting semantic structure (Kuyten et al., 2012) and other use both (Heilman, 2011).
Generation from structured data has been studied for years in language interfaces, which usually obtain the system inquiries and responses from application specifications and domain-restricted bases. Domain knowledge representation has been incorporated into a considerable number of relevant dialogue systems (Guzzoni et al., 2006;Sonntag et al., 2007), because they facilitate the adaptation of knowledge to different domains, languages, user types and modes of communication. Additionally, they provide synonyms, hyponyms and hyperonyms terms to improve the query.
There is an increasing interest in the combination of language and user model techniques to obtain personalized linguistic resources (Brusilovsky and Millán, 2007;Milosavljevic and Oberlander, 1998;Stock et al., 2007;Han et al., 2014).
This article describes an approach to guide the user about the web contents based on the generation of personalized questions, answers and grammars from web documents, because they could result useful in different scenarios, as the students opinion on the questions generated about course exams ( shown in Table 1) indicates.
This work proposes an organization of the different type of knowledge involved (conceptual, linguistic and about the user) that minimizes the human effort needed for a new domain and/or a new language, by separating the general facts that can be reused across domains from those more specific. The representation is based on relating the set of attributes describing main domain concepts to the user models and to a taxonomy representing general relationships between conceptual and linguistic knowledge. The linguistic information associated with the taxonomy classes is used for both generating questions and answers and grammars and also for extracting data from the web.
Future work could include the study of the adaptation of the set of rules developed to extract the data from the web documents to new domains.