Mapping natural language commands to web elements

The web provides a rich, open-domain environment with textual, structural, and spatial properties. We propose a new task for grounding language in this environment: given a natural language command (e.g., “click on the second article”), choose the correct element on the web page (e.g., a hyperlink or text box). We collected a dataset of over 50,000 commands that capture various phenomena such as functional references (e.g. “find who made this site”), relational reasoning (e.g. “article by john”), and visual reasoning (e.g. “top-most article”). We also implemented and analyzed three baseline models that capture different phenomena present in the dataset.


Introduction
Web pages are complex documents containing both structured properties (e.g., the internal tree representation) and unstructured properties (e.g., text and images). Due to their diversity in content and design, web pages provide a rich environment for natural language grounding tasks.
In particular, we consider the task of mapping natural language commands to web page elements (e.g., links, buttons, and form inputs), as illustrated in Figure 1. While some commands refer to an element's text directly, many others require more complex reasoning with the various aspects of web pages: the text, attributes, styles, structural data from the document object model (DOM), and spatial data from the rendered web page.
Our task is inspired by the semantic parsing literature, which aims to map natural language utterances into actions such as database queries and object manipulation (Zelle and Mooney, 1996;Chen and Mooney, 2011;Artzi and Zettlemoyer, 2013;Berant et al., 2013;Misra et al., 2015;Andreas and Klein, 2015). While these actions usually act on an environment with a fixed and known schema, web pages contain a larger variety of structures, making the task more open-ended. At the same time, our task can be viewed as a reference game (Golland et al., 2010;Smith et al., 2013;Andreas and Klein, 2016), where the system has to select an object given a natural language reference. The diversity of attributes in web page elements, along with the need to use context to interpret elements, makes web pages particularly interesting.
Identifying elements via natural language has several real-world applications. The main one is providing a voice interface for interacting with web pages, which is especially useful as an assistive technology for the visually impaired (Zajicek et al., 1998;Ashok et al., 2014). Another use case is browser automation: natural language commands are less brittle than CSS or XPath selectors (Hammoudi et al., 2016) and could generalize across different websites.
We collected a dataset of over 50,000 natural language commands. As seen in Figure 1, the commands contain many phenomena, such as relational, visual, and functional reasoning, which we analyze in greater depth in Section 2.2. We also implemented three models for the task based on retrieval, element embedding, and text alignment. Our experimental analysis shows that functional references, relational references, and visual reasoning are important for correctly identifying elements from natural language commands.

Task
Given a web page w with elements e 1 , . . . , e k and a command c, the task is to select the element e ∈ {e 1 , . . . , e k } described by the command c. The training and test data contain (w, c, e) triples.

Dataset
We collected a dataset of 51,663 commands on 1,835 web pages. To collect the data, we first archived home pages of the top 10,000 websites 1 by rendering them in Google Chrome. After loading the dynamic content, we recorded the DOM trees and the geometry of each element, and stored the rendered web pages. We filtered for web pages in English that rendered correctly and did not contain inappropriate content. Then we asked crowdworkers to brainstorm different actions for each web page, requiring each action to reference exactly one element (of their choice) from the filtered list of interactive elements (which include visible links, inputs, and buttons). We encouraged 1 https://majestic.com/reports/majestic-million the workers to avoid using the exact words of the elements by granting a bonus for each command that did not contain the exact wording of the selected element. Finally, we split the data into 70% training, 10% development, and 20% test data. Web pages in the three sets do not overlap. The collected web pages have an average of 1,051 elements, while the commands are 4.1 tokens long on average.

Phenomena present in the commands
Apart from referring to the exact text of the element, commands can refer to elements in a variety of ways. We analyzed 200 examples from the training data and broke down the phenomena present in these commands (see Table 1).
Even when the command directly references the element's text, many other elements on the page also have word overlap with the command. On average, commands have word overlap with 5.9 leaf elements on the page (not counting stop words).

Retrieval-based model
Many commands refer to the elements by their text contents. As such, we first consider a simple retrieval-based model that uses the command as a search query to retrieve the most relevant element based on its TF-IDF score.
Specifically, each element is represented as a bag-of-tokens computed by (1) tokenizing and stemming its text content, and (2) tokenizing the attributes (id, class, placeholder, label, tooltip, aria-text, name, src, href) at punctuation marks and camel-case boundaries. When computing term frequencies, we downweight the attribute tokens from (2) by a factor of α. We use α = 3 tuned on the development set for our experiments.
The document frequencies are computed over the web pages in the training dataset. If multiple elements have the same score, we heuristically pick the most prominent element, i.e., the one that appears earliest in the pre-order traversal of the DOM hierarchy.

Embedding-based model
A common method for matching two pieces of text is to embed them separately and then compute a score from the two embeddings (Kiros et al., 2015;Tai et al., 2015). For a command c and elements e 1 , . . . , e k , we define the following conditional distribution over the elements: where s is a scoring function, f (c) is the embedding of c, and g(e i ) is the embedding of e i , described below. The model is trained to maximize the log-likelihood of the correct element in the training data.
Command embedding. To compute f (c), we embed each token of c into a fixed-dimensional vector and take an average 2 over the token embeddings. (The token embeddings are initialized with GloVe vectors.) Element embedding. To compute g(e), we embed the properties of e, concatenate the results, and then apply a linear layer to obtain a vector of the same length as f (c). Figure 2 shows an example of the properties that the model receives. The properties include: • Text content. We apply the command embedder f on the text content of e. As the text of most elements of interest (links, buttons, and inputs) are short, we find it sufficient to limit the text to the first 10 tokens to save memory.
• Text attributes. Several attributes (aria, title, tooltip, placeholder, label, name) usually contain natural language. We concatenate their values and then apply the command embedder f on the resulting string. 2 We tried applying LSTM but found no improvement.
<a class="dd-head" id="tip-link" href="submit_story/">Tip Us</a>  • String attributes. We tokenize other string attributes (tag, id, class) at punctuation marks and camel-case boundaries. Then we embed them with separate lookup tables and average the resulting vectors.
• Visual features. We form a vector consisting of the coordinates of the element's center (as fractions of the page width and height) and visibility (as a boolean).
Scoring function. To compute the score s(f (c), g(e)), we first letf (c) andĝ(e) be the results of normalizing the two embeddings to unit norm. Then we apply a linear layer on the concatenated vector [f (c);ĝ(e);f (c) •ĝ(e)] (where • denotes the element-wise product).
Incorporating spatial context. Context is critical in certain cases; for example, selecting a text box relies on knowing the neighboring label text, and selecting an article based on the author requires locating the author's name nearby. Identifying which related element should be considered based on the command is a challenging task. We experiment with adding spatial context to the model. For each direction d ∈ {top, bottom, left, right}, we use g to embed a neighboring element n d (e) directly adjacent to e in that direction. (If there are multiple such elements, sample one; if there is no such element, use a zero vector.) After normalizing the results to getĝ(n d (e)), we concatenateĝ(n d (e)) andf (c) •ĝ(n d (e)) to the linear layer input in the scoring function.

Alignment-based model
One downside of the embedding-based model is that the text tokens from c and e do not directly interact. Previous works on sentence matching usually employ either unidirectional or bidirectional attention to tackle this issue (Seo et al., 2016;Yin et al., 2016;Yu et al., 2018). We opt for a simple method based on a single alignment matrix similar to Hu et al. (2014) as described below.
Let t(e) be the concatenation of e's text content and text attributes of e, trimmed to 10 tokens. We construct a matrix A(c, e) where each entry A ij (c, e) is the dot product between the embeddings of the ith token of c and the jth token of t(e). Then we apply two convolutional layers of size 3×3 on the matrix, apply a max-pooling layer of size 2 × 2, concatenate a tag embedding, and then apply a linear layer on the result to get a 10dimensional vector h(c, e).
We apply a final linear layer on h(c, e) to compute a scalar score, and then train on the same objective function as the encoding-based model. To incorporate context, we simply concatenate the four vectors h(c, n d (e)) of the neighbors n d (e) to the final linear layer input.

Experiments
We evaluate the models on accuracy, the fraction of examples that the model selects the correct element. We train the neural models using Adam (Kingma and Ba, 2014) with initial learning rate 10 −3 , and apply early stopping based on the development set. The models can choose any element that is visible on the page at rendering time.
The experimental results are shown in Table 2. Both neural models significantly outperform the retrieval model.

Ablation analysis
To measure the importance of each type of information in web pages, we perform an ablation study where the model does not observe one of the following aspects of the elements: text contents, attributes, and spatial context. Unsurprisingly, the results in Table 2 show that text contents are the most important input signal. However, attributes also play an important role in both the embedding and alignment models. Finally, while spatial context increases alignment model performance, the gain is very small, suggesting that incorporating appropriate contexts to the model is a challenging task due to the variety in the types of context, as well as the sparsity of the signals.

Error analysis
To get a better picture of how the models handle different phenomena, we analyze the predictions of the embedding-based and alignmentbased models on 100 development examples where at least one model made an error. The errors, summarized in Table 3, are elaborated below: Fail to match strings. Many commands simply specify the text content of the element (e.g., "click customised garages" → the link with text "Customised Garages, Canopies & Carports"). The encoding model, which encodes the whole command as a single vector, occasionally fails to select the element with partially matching texts. In contrast, the alignment model explicitly models text matching, and thus is better at this type of commands.
Incorrectly match strings. Due to its reliance on text matching, the alignment model struggles when many elements share substrings with the command (e.g., "shop for knitwear" when many elements contain the word "shop"), or when an element with a matching substring is not the correct target (e.g., "get the program" → the "Download" link, not the "Microsoft developer program" link).
Fail to understand descriptions. As seen in Table 1, many commands indirectly describe the elements using paraphrases, goal descriptions, or properties of the elements. The encoding model, which summarizes various properties of the elements as an embedding vector, is better at handling these commands than the alignment model, but still makes a few errors on harder examples (e.g., "please close this notice for me" → the "X" button with hidden text "Hide").
Fail to perform reasoning. For the most part, the models fail to handle relational, ordinal, or spatial reasoning. The most frequent error mode is when the element is a text box, and the command uses nearby label as the reference. While a few text boxes have semantic annotations which the model can use (e.g., tooltip or aria attributes), many web pages do not provide such annotations.
To handle these cases, a model would have to identify the label of the text box based on logical or visual contexts.
Other errors. Apart from the annotation noise, occasionally multiple elements on the web page satisfy the given command (e.g., "log in" can match any "Sign In" button on the web page). In these cases, the annotation usually gives the most prominent element among the possible candidates.
To provide a natural interface for users, the model should arguably learn to predict such prominent elements instead of more obscure ones.

Related work and discussion
Mapping natural language to actions. Previous work on semantic parsing learns to perform actions described by natural language utterances in various environments. Examples of such actions include API calls (Young et al., 2013;Su et al., 2017;Bordes and Weston, 2017), database queries (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2007;Berant et al., 2013;Yih et al., 2015), navigation (Artzi and Zettlemoyer, 2013;Janner et al., 2018), and object manipulation (Tellex et al., 2011;Andreas and Klein, 2015;Guu et al., 2017;Fried et al., 2018). For web pages and graphical user interfaces, there are previous works on using natural language to perform computations on web tables (Pasupat and Liang, 2015; and submit web forms (Shi et al., 2017). Our task is similar to previous works on interpreting instructions on user interfaces (Branavan et al., 2009(Branavan et al., , 2010Liu et al., 2018). While their works focuses on learning from distant supervision, we consider shallower interactions but on a much broader domain.
Previous work also explores the reverse problem of generating natural language description of objects (Vinyals et al., 2014;Karpathy and Fei-Fei, 2015;Zarriaiß and Schlangen, 2017). We hope that our dataset could also be useful for exploring the reverse task of describing actions on web pages.
Reference games. In a reference game, the system has to select the correct object referenced by the given utterance (Frank and Goodman, 2012). Previous work on reference games focuses on a small number of objects with similar properties, and applies pragmatics to handle ambiguous utterance (Golland et al., 2010;Smith et al., 2013;Çelikyilmaz et al., 2014;Andreas and Klein, 2016;Yu et al., 2017). Our task can be viewed as a reference game with several challenges: higher number of objects, diverse object properties, and the need to interpret objects based on their contexts.
Interacting with web pages. Automated scripts are used to interact with web elements. While most scripts reference elements with logical selectors (e.g., CSS and XPath), there have been several alternatives such as images (Yeh et al., 2009) and simple natural language utterances (Soh, 2017). Some other interfaces for navigating web pages include keystrokes (Spalteholz et al., 2008), speech (Ashok et al., 2014), haptics (Yu et al., 2005), and eye gaze (Kumar et al., 2007).

Conclusion
We presented a new task of grounding natural language commands on open-ended and semistructured web pages. With different methods of referencing elements, mixtures of textual and nontextual element attributes, and the need to properly incorporate context, our task offers a challenging environment for language understanding with great potential for real-world applications.
Our dataset and code are available at https: