SemEval-2017 Task 11: End-User Development using Natural Language

This task proposes a challenge to support the interaction between users and applications, micro-services and software APIs using natural language. The task aims for supporting the evaluation and evolution of the discussions surrounding the natural language processing approaches within the context of end-user natural language programming, under scenarios of high semantic heterogeneity/gap.


Introduction
The specific syntax of traditional programming languages and the user effort associated with finding, understanding and integrating multiple interfaces within a software development task, defines the intrinsic complexity of programming. Despite the widespread demand for automating actions within a digital environment, even the basic software development tasks require previous (usually extensive) software development expertise. Domain experts processing data, analysts automating recurrent tasks, or a businessman testing an idea on the web depend on the mediation of programmers to materialise their demands, independently of the simplicity of the task to be addressed and on the availability of existing services and libraries.
Recent advances in natural language processing bring the opportunity of improving the interaction between users and software artefacts, supporting users to program tasks using natural languagebased communication. This ability to match users' actions intents and information needs to formal actions within an application programming interface (API), using the semantics of natural language as the mediation layer between both, can drastically impact the accessibility of software de-velopment. Despite the fact that some software development tasks with stricter requirements will always depend on the precise semantic definition of programming languages, there is a vast spectrum of applications with softer formalisation requirements. This subset of applications can be defined and built with the help of natural language descriptions.
This SemEval task aims to develop the state-ofthe-art discussions and techniques concerning the semantic interpretation of natural language commands and user action intents, bridging the semantic gap between users and software artefacts. The practical relevance of the challenge lies in the fact that addressing this task supports improving the accessibility of programming (meaning a systematic specification of computational operations) to a large spectrum of users which have the demand for increasing automation within some specific tasks. Moreover, with the growing availability of software artefacts, such as APIs and services, there is a higher demand to support the discoverability of these resources, i.e. devising principled semantic interpretation approaches to semantically match interface descriptions with the intent from users.
The proposed task also intersects with demands from the field of robotics, as part of the humanrobot interaction area, which depends on a systematic ability to address user commands that lie beyond navigational tasks.
From the point-of-view of computational linguistics, this challenge aims to catalyse the discussions in the following dimensions: • Semantic parsing of natural language commands; • Semantic representation of software interfaces; • Statistical and ontology-based semantic matching techniques; • Compositional models for natural language command interpretation (NLCI); • Machine learning models for NLCI; • API/Service composition and associated planning techniques; • Linguistic aspects of user action intents.

Commands & Programming in Natural Language
The use of natural language to instruct robots and computational systems, in general, is an active research area since the 70's and 80's (Maas and Suppes, 1985;Guida and Tasso, 1982) (and within references). Initiatives vary over a large spectrum of application domains including operating system's functions (Manaris and Dominick, 1993), web services choreography (Englmeier et al., 2006), mobile programming by voice (Amos Azaria, 2016), domain-specific natural programming languages (Pane and Myers, 2006), industrial robots (Stenmark and Nugues, 2013) and home care assistants. The variability of domains translates into a wide number of research communities comprising different foci and being expressed by distinct terms such as natural language interfaces, end-user development, natural programming, programming by example and trigger-action development. Some of these terms embrace wide domains, also including non-verbal (visual) approaches.

Semantic Parsing & Matching
The interpretation of natural language commands is typically associated with the task of parsing the natural language input to an internal representation of the target system. This internal representation is usually associated with a n-ary predicateargument structure which represents the interface for an action within the system. The identification of which action the command refers to and its potential parameters are at the centre of this task.
Taking as an example the natural language command: Please convert US$ 475 to the Japanese currency and send this value to John Smith by SMS.
We can conceptualise the challenges involved in the command interpretation process in three dimensions: command chunking, term type identification and semantic matching. The chunking dimension comprises the identification of terms and segments in the original sentence that can potentially map to the system actions and parameters. The example command embodies two actions: converting currency and sending SMS. For the first action, the command interpreter needs to identify the currencies involved in the transaction and the financial amount (term type identification).
Other semantic interpretation processes might be involved. In the case of the second action, besides identifying John Smith as the message's receiver, the interpreter also needs to resolve the coreference of this value to the currency conversion result and instantiate it as a parameter in the content of the message. This first level of interpretation of the command would generate an output such as: The matching process corresponds to the mapping between terms from the user vocabulary to the terms used in the internal representation of the system (the API). In the given example, the system should find an action that can convert currencies and another that can send SMS messages.
In the example, depending on the parametrisation of the command interface, the value [US$ 475] needs to be split into two parameters, and these parts, mapped to the internal vocabulary of the system (US$ need to be interpreted as USD while Japanese currency needs to be translated to JPY. For the second action, similarly, John Smith will be used to retrieve a phone number from a user personal data source.
The final execution command is the result of the matching processing, as shown below: from: "USD" to: "JPY" from amount: 475 The task can be addressed using different semantic interpretation abstractions: shallow parsing, lambda-calculus-based semantic parsing (Artzi et al., 2014), compositional-distributional models (Freitas and Curry, 2014;Freitas, 2015), information retrieval approaches (Sales et al., 2016). Additionally, pre-processing techniques such as clausal disembedding (Niklaus et al., 2016) and co-reference resolution are central components within the task.
While approaches and test collections emphasising the shallow parsing aspect of the problem are more present in the literature (Section 3), others focusing on a semantic matching process involving a broader vocabulary gap (Furnas et al., 1987) are less prevalent. Part of this can be explained by the domain-specific nature of previous works (e.g. focus on spatial commands (Dukes, 2014)).
In contrast, this task emphasises the creation of a test collection targeting an open domain scenario, with a large-scale set of target actions, assessing the ability of command interpretation approaches to address a larger vocabulary gap. This scenario aims to instantiate a real use case for end-user natural language programming, since the action knowledge base used in the test collection maps to real-world APIs and so a semantic interpreter developed over this test collection can become a concrete end-user programming environment.

Similar Initiatives
Most of the applications related to the parsing of natural language commands are within the context of human-robot interaction. The Human Robot Interaction Corpus (HuRIC) describes a list of spoken commands between humans and robots. It is composed of three datasets which were developed under the context of three different events. They are annotated using Frame Semantics together with Holistic Spatial Semantics (Bastianelli et al., 2014). Artzi et al. (2014) and Tellex et al. (2014) give a more focused contribution in the interpretation of spatial elements. In both cases, the vocabulary variability is more constrained. Similar vocabulary variability assumptions are present in Thomason et al. (2015) and Azaria et al. (2016).
In 2014, SemEval hosted a task related to the parsing of natural language spatial commands (Dukes, 2014), also targeting a robotics scenario. More specifically, the task proposed the parsing of commands to move a robot arm that moved objects within a spatial region.
The proposed task can be contrasted with these previous initiatives in the following dimensions: (i) more comprehensive knowledge base of actions, (ii) generic (open domain) user programming scenarios and (iii) exploration of the interaction between actions and user personal information (Section 4).
The work that has more similarity with this test collection is the problem defined by Quirk et al. (2015) under the ifttt.com platform, which targets the creation of an if-then receipt from a natural language description provided by the user. The first difference between the two tasks is the fact that, while the program structure is limited to ifthen recipes in Quirk et al., other more complex structures are supported in this task. Secondly, in the case of Quirk et al., the task requires only the mapping of the actions that comprise the recipe, keeping aside the instantiation of the parameter values, while our proposed task emphasises both. Finally, the presence of these two characteristics introduces the challenge of mapping co-references and metonymy within the task.

Task Definition
The task comprises 210 scenarios which consist of a total of 438 natural language commands. Figures 1 and 2 depicts an overview of the task. A scenario is a set of sentences that defines a program in natural language. The excerpt below shows an example of a scenario: "When a message from Enrico Hernandez arrives, get the necklace price; Convert it from Chilean Pesos to Euro; If it costs less than 100 EUR, send to him a message asking him to buy it; If not, write saying I am not interested." Associated with each scenario, there is a program which is composed of actions from the Action Knowledge Base (Action KB). In addition to the actions, the program also uses If and Foreach constructors, having the same semantics commonly expressed in programming languages to define the execution flow.
Like a programming language function, an action can have input parameters and return values. Table 1 shows examples of natural language commands describing scenarios. Natural language scenario commands If a receive a deposit from John Sanders in my bank account, send this message to him: "Hello John, thanks for your gift, I receive your deposit of some money to me, thanks a lot, buddy." Send an email to Mark asking him for the picture we took in Munich. When I receive the answer, get the attached image and publish it on my Flickr account with the tags #munich, #germany, #my-love Find "Bachianas N.5 of Villa-Lobos" on Youtube. Get the link and send to my mum. List tweets containing #ChampionsLeague.  The values of the parameters map to constants (e.g. integer numbers, string values) or to tags, which represent returning data from previously executed actions. There are two types of tags.
• <returnX> The return tag represents the content returned by the action X, where X is a sequential identifier.
• <item> The item tag is used only in the context of Foreach constructors. It represents an iterated item.
Both types of tags have some additional naming assumptions in order to simplify the syntax of the generated program. Examples of valid tags are: • <return1> -meaning the data returned by the first action in the scenario.
• <item>.url -represent the attribute url of the item.
In addition to the scenarios, the test collection consists of: • Action KB: The set of available API functions along with their respective documentation. The information describing the API functions does not follow a strict pattern. While some documentation has rich natural language descriptions or show usage examples, others are succinct and just contain the frame and parameter names. The same occurs concerning data format, data type and returning data. This test collection reflects the variability and heterogeneity that we find in real-world APIs.
• User KB: A personal user information dataset, which is necessary to make commands more natural by supporting coreference resolution. It allows commands like "Call John", once the system can identify the proper phone number from the User KB.

Annotation
The scenarios containing the natural language commands were created using high-level task descriptions. These high-level task descriptions were sent to a crowdsourcing platform (CrowdFlower), in which workers were requested to express in natural language the commands which entail the scenario descriptions. Motivated by those scenario descriptions, the users proposed a set of commands which addresses the specification. The excerpt below shows an example of a scenario description: You are arranging a meeting with some people in Andre's office. Adamantios is coming for that meeting, but he does not know how to drive in Passau. Additionally, you do not know where the office is.
One possible output for that description is: • Ask Andre for the address of his office; • Make a map from the university to it; • Send the map to Adamantios including driving directions.
For each scenario description, in average ten workers were invited to suggest the natural language commands. The crowdsourcing process was followed by a data curation process which discarded 70% of the commands due to low quality issues. The other part of the sample was reviewed to correct misspelling and adjusted to comply with the task requirements while preserving the original syntactic structure and vocabulary.

Analysis of The Task Complexity
The task aims to explore vocabulary and syntactic structure variation within the natural lan-guage commands. It also targets the orchestration of different natural language processing techniques, including syntactic parsing, semantic role labelling, fine-grained semantic approximation and co-reference resolution.

Semantic approximation
Different actions and parameters can be expressed using distinct lexicalizations (synonymy) and abstraction levels. For example: "If someone reports a problem in GitHub, send the problem's headline by Skype to John." In the examples, the action in the knowledge base is expressed as "any new issue", while intended "headline" in the returned value is expressed as "Issue Title". Given the context, it is expected the system to be able to identify the equivalence between the pairs of terms (problem, issue) and (title, headline).

Syntactic variation
Additionally, interpreters are expected to cope with syntactic variation.
"If Manchester United wins, call me." "Get ready to call me in the case of victory of Manchester United."

Co-reference and metonymy resolution
The first type of resolution needed is the pronominal co-reference, where a pronoun refers to a constant which was previously mentioned within the context of the same scenario. The metonymy resolution consists of using the reference to an attribute or type to refer to a constant or to a different attribute of a constant. For example: "If an issue is created, send its content to the Tech Manager." This excerpt shows both cases. The bold its makes reference to an issue, while Tech Manager is a metonymy for the Tech Manager's email (san-dra@andrade.com.br according to the user KB).

Evaluation
The final dataset contains commands and their associated mappings to the Action KB. Given a command in natural language, it is expected that the participating systems provide: • The correct action; • The correct mapping of text chunks in the natural commands to parameters; The participating systems were evaluated considering four criteria: Criteria 1 and 2 are quantified by using precision and recall, while 3 and 4 are quantified by the percentage of the total number of scenarios which were addressed.
Participating teams were allowed to use external linguistic resources and external tools such as taggers and parsers.

Participants and Results
Initially, nine teams demonstrated interest in the tasks, but only one participated in the challenge. Kubis et al. (2017) proposed the EUDAMU system, which implements an action ranking model based on TF/IDF and a type matching system.
The EUDAMU system is composed of a pipeline divided into six steps. It starts by preprocessing the dataset using three tools (NLTK, Core-NLP and SyntaxNet). In the pre-processing step, natural language commands are tokenized and each token is enriched with its lemma, partof-speech and named entity labels. Additionally, it also adds the constituent and dependency structures associated with the commands. The final pre-processing step annotates the commands with types which supports the system to resolve co-references between the actions and references from the User KB. The same procedure (with the exception of the last step) is applied for the Action KB.
The preprocessing phase is followed by the Discourse Tagger, which is responsible for individualising the command from the paragraph description of the scenario. The team implemented this component using a rule-based approach. The next step is Action Ranker, which applies a TF-IDF  Matcher that is designed to identify which output of a given action act as the parameters of a subsequent action. The next step is the Parameter Matcher. It infers parameter and value types which can serve as a support to the action matching process. Finally, based on the knowledge generated and stored in the previous steps, the rulebased Statement Mapper provides a list of up to 10 elements of possible matching action instances. Additional details of the proposed method can be found in the original paper (Kubis et al., 2017). Table 4 shows its results. While the proposed solution has a high recall for the number of resolved actions, it fails mainly in providing the correct value for all the required parameters. Two types of linguistic settings showed to be more challenging: • Description of commands split into two sentences. For example: "Get the price of the book The Intelligent Investor. If it costs less than 25 Euros, buy it." where "25 Euros" is the parameter value of the action defined in the first sentence.
• Capturing actions with more specific/finegrained semantics. For example: "Once I have bet my running distance target of the week, set my current weight as 100 Kg in Fitbit." where the system ignored the temporal expression"of the week" and suggested the "Daily step goal achieved" instead of "Weekly distance goal reached" action. A second example of the same case is expressed in the command: "Suspend the execution of my Samsung washer." where the term "Samsung" was ignored when selecting actions.

Summary
In the Semeval 2017 Task 11 we developed a test collection to support the creation of semantic interpretation methods for end-user programming environments. The test collection focuses on the following features in comparison with existing approaches: (i) open domain, (ii) large syntactic and vocabulary variability, (iii) dependent of coreference and metonymy resolution. Moreover, as the test collection uses APIs available on the open web, it can be used to build real end-user programming environments. While there is space for the improvement of the precision and recall on the identification of the command actions, the main challenge remains in the matching of the parameters between natural language commands and the API.