Mapping Natural Language Instructions to Mobile UI Action Sequences

We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PixelHelp, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in How-To instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PixelHelp.


Introduction
Language helps us work together to get things done.People instruct one another to coordinate joint efforts and accomplish tasks involving complex sequences of actions.This takes advantage of the abilities of different members of a speech community, e.g. a child asking a parent for a cup she cannot reach, or a visually impaired individual asking for assistance from a friend.Building computational agents able to help in such interactions is an important goal that requires true language grounding in environments where action matters.
An important area of language grounding involves tasks like completion of multi-step actions in a graphical user interface conditioned on language instructions (Branavan et al., 2009(Branavan et al., , 2010;;Liu et al., 2018;Gur et al., 2019).These domains matter for accessibility, where language interfaces could help visually impaired individuals perform tasks with open the app drawer.navigate to settings > network & internet > Wifi.click add network, and then enter starbucks for SSID.Executable actions based on the screen at each step Action Phrase Tuples

Grounding Model
Transition to next screen

…
Mobile User Interface at each step Figure 1: Our model extracts the phrase tuple that describe each action, including its operation, object and additional arguments, and grounds these tuples as executable action sequences in the UI.
interfaces that are predicated on sight.This also matters for situational impairment (Sarsenbayeva, 2018) when one cannot access a device easily while encumbered by other factors, such as cooking.
We focus on a new domain of task automation in which natural language instructions must be interpreted as a sequence of actions on a mobile touchscreen UI.Existing web search is quite capable of retrieving multi-step natural language instructions for user queries, such as "How to turn on flight mode on Android."Crucially, the missing piece for fulfilling the task automatically is to map the returned instruction to a sequence of actions that can be automatically executed on the device with arXiv:2005.03776v1[cs.CL] 7 May 2020 little user intervention; this our goal in this paper.This task automation scenario does not require a user to maneuver through UI details, which is useful for average users and is especially valuable for visually or situationally impaired users.The ability to execute an instruction can also be useful for other scenarios such as automatically examining the quality of an instruction.
Our approach (Figure 1) decomposes the problem into an action phrase-extraction step and a grounding step.The former extracts operation, object and argument descriptions from multi-step instructions; for this, we use Transformers (Vaswani et al., 2017) and test three span representations.The latter matches extracted operation and object descriptions with a UI object on a screen; for this, we use a Transformer that contextually represents UI objects and grounds object descriptions to them.
We construct three new datasets1 .To assess full task performance on naturally occurring instructions, we create a dataset of 187 multi-step English instructions for operating Pixel Phones and produce their corresponding action-screen sequences using annotators.For action phrase extraction training and evaluation, we obtain English How-To instructions from the web and annotate action description spans.A Transformer with spans represented by sum pooling (Li et al., 2019) obtains 85.56% accuracy for predicting span sequences that completely match the ground truth.To train the grounding model, we synthetically generate 295k single-step commands to UI actions, covering 178K different UI objects across 25K mobile UI screens.
Our phrase extractor and grounding model together obtain 89.21% partial and 70.59% complete accuracy for matching ground-truth action sequences on this challenging task.We also evaluate alternative methods and representations of objects and spans and present qualitative analyses to provide insights into the problem and models.

Problem Formulation
Given an instruction of a multi-step task, I = t 1:n = (t 1 , t 2 , ..., t n ), where t i is the ith token in instruction I, we want to generate a sequence of automatically executable actions, a 1:m , over a sequence of user interface screens S, with initial screen s 1 and screen transition function s j =τ (a j−1 , s j−1 ): An action a j = [r j , o j , u j ] consists of an operation r j (e.g.Tap or Text), the UI object o j that r j is performed on (e.g., a button or an icon), and an additional argument u j needed for o j (e.g. the message entered in the chat box for Text or null for operations such as Tap).Starting from s 1 , executing a sequence of actions a <j arrives at screen s j that represents the screen at the jth step: Each screen s j = [c j,1:|s j | , λ j ] contains a set of UI objects and their structural relationships.c j,1: where |s j | is the number of objects in s j , from which o j is chosen.λ j defines the structural relationship between the objects.This is often a tree structure such as the View hierarchy for an Android interface2 (similar to a DOM tree for web pages).
An instruction I describes (possibly multiple) actions.Let āj denote the phrases in I that describes action a j .āj = [r j , ōj , ūj ] represents a tuple of descriptions with each corresponding to a span-a subsequence of tokens-in I. Accordingly, ā1:m represents the description tuple sequence that we refer to as ā for brevity.We also define Ā as all possible description tuple sequences of I, thus ā ∈ Ā.
p(a j |s j , t 1:n ) = Ā p(a j |ā, s j , t 1:n )p(ā|s j , t 1:n ) (3) Because a j is independent of the rest of the instruction given its current screen s j and description āj , and ā is only related to the instruction t 1:n , we can simplify (3) as (4).
p(a j |s j , t 1:n ) = Ā p(a j |ā j , s j )p(ā|t 1:n ) (4) We define â as the most likely description of actions for t 1:n .
p(a j |â j , s j )p(â j |â <j , t 1:n ) (7) p(â j |â <j , t 1:n ) identifies the description tuples for each action.p(a j |â j , s j ) grounds each description to an executable action given the screen.

Data
The ideal dataset would have natural instructions that have been executed by people using the UI.Such data can be collected by having annotators perform tasks according to instructions on a mobile platform, but this is difficult to scale.It requires significant investment to instrument: different versions of apps have different presentation and behaviors, and apps must be installed and configured for each task.Due to this, we create a small dataset of this form, PIXELHELP, for full task evaluation.For model training at scale, we create two other datasets: ANDROIDHOWTO for action phrase extraction and RICOSCA for grounding.Our datasets are targeted for English.We hope that starting with a high-resource language will pave the way to creating similar capabilities for other languages.

PIXELHELP Dataset
Pixel Phone Help pages 3 provide instructions for performing common tasks on Google Pixel phones such as switch Wi-Fi settings (Fig. 2) or check emails.Help pages can contain multiple tasks, with each task consisting of a sequence of steps.We pulled instructions from the help pages and kept ones that can be automatically executed.Instructions that requires additional user input such as Tap the app you want to uninstall are discarded.Also, instructions that involve actions on a physical button such as Press the Power button for a few seconds are excluded because these events cannot be executed on mobile platform emulators.
We instrumented a logging mechanism on a Pixel Phone emulator and had human annotators perform each task on the emulator by following the full instruction.The logger records every user action, including the type of touch events that are triggered, each object being manipulated, and screen information such as view hierarchies.Each item thus includes the instruction input, t 1:n , the screen for each step of task, s 1:m , and the target action performed on each screen, a 1:m .
In total, PIXELHELP includes 187 multi-step instructions of 4 task categories: 88 general tasks, such as configuring accounts, 38 Gmail tasks, 31 Chrome tasks, and 30 Photos related tasks.The number of steps ranges from two to eight, with a median of four.Because it has both natural instructions and grounded actions, we reserve PIX-ELHELP for evaluating full task performance.

ANDROIDHOWTO Dataset
No datasets exist that support learning the action phrase extraction model, p(â j |â <j , t 1:n ), for mobile UIs.To address this, we extracted English instructions for operating Android devices by processing web pages to identify candidate instructions for how-to questions such as how to change the input method for Android.A web crawling service scrapes instruction-like content from various websites.We then filter the web contents using both heuristics and manual screening by annotators.
Annotators identified phrases in each instruction that describe executable actions.They were given a tutorial on the task and were instructed to skip instructions that are difficult to understand or label.For each component in an action description, they select the span of words that describes the component using a web annotation interface (details are provided in the appendix).The interface records the start and end positions of each marked span.Each instruction was labeled by three annotators: three annotators agreed on 31% of full instructions and at least two agreed on 84%.For the consistency at the tuple level, the agreement across all the annotators is 83.6% for operation phrases, 72.07%for object phrases, and 83.43% for input phrases.The discrepancies are usually small, e.g., a description marked as your Gmail address or Gmail address.
The final dataset includes 32,436 data points from 9,893 unique How-To instructions and split into training (8K), validation (1K) and test (900).
All test examples have perfect agreement across all three annotators for the entire sequence.In total, there are 190K operation spans, 172K object spans, and 321 input spans labeled.The lengths of the instructions range from 19 to 85 tokens, with median of 59.They describe a sequence of actions from one to 19 steps, with a median of 5.

RICOSCA Dataset
Training the grounding model, p(a j |â j , s j ) involves pairing action tuples a j along screens s j with action description âj .It is very difficult to collect such data at scale.To get past the bottleneck, we exploit two properties of the task to generate a synthetic command-action dataset, RICOSCA.First, we have precise structured and visual knowledge of the UI layout, so we can spatially relate UI elements to each other and the overall screen.Second, a grammar grounded in the UI can cover many of the commands and kinds of reference needed for the problem.This does not capture all manners of interacting conversationally with a UI, but it proves effective for training the grounding model.
Rico is a public UI corpus with 72K Android UI screens mined from 9.7K Android apps (Deka et al., 2017).Each screen in Rico comes with a screenshot image and a view hierarchy of a collection of UI objects.Each individual object, c j,k , has a set of properties, including its name (often an English phrase such as Send), type (e.g., Button, Image or Checkbox), and bounding box position on the screen.We manually removed screens whose view hierarchies do not match their screenshots by asking annotators to visually verify whether the bounding boxes of view hierarchy leaves match each UI object on the corresponding screenshot image.This filtering results in 25K unique screens.
For each screen, we randomly select UI elements as target objects and synthesize commands for operating them.We generate multiple commands to capture different expressions describing the operation rj and the target object ôj .For example, the Tap operation can be referred to as tap, click, or press.The template for referring to a target object has slots Name, Type, and Location, which are instantiated using the following strategies: • Name-Type: the target's name and/or type (the OK button or OK).• Absolute-Location: the target's screen location (the menu at the top right corner).• Relative-Location: the target's relative location to other objects (the icon to the right of Send).Because all commands are synthesized, the span that describes each part of an action, âj with respect to t 1:n , is known.Meanwhile, a j and s j , the actual action and the associated screen, are present because the constituents of the action are synthesized.In total, RICOSCA contains 295,476 single-step synthetic commands for operating 177,962 different target objects across 25,677 Android screens.

Model Architectures
Equation 7 has two parts.p(â j |â <j , t 1:n ) finds the best phrase tuple that describes the action at the jth step given the instruction token sequence.p(a j |â j , s j ) computes the probability of an executable action a j given the best description of the action, âj , and the screen s j for the jth step.

Phrase Tuple Extraction Model
A common choice for modeling the conditional probability p(ā j |ā <j , t 1:n ) (see Equation 5) are encoder-decoders such as LSTMs (Hochreiter and Schmidhuber, 1997) and Transformers (Vaswani et al., 2017).The output of our model corresponds to positions in the input sequence, so our architecture is closely related to Pointer Networks (Vinyals et al., 2015).
Figure 3 depicts our model.An encoder g computes a latent representation h 1:n ∈R n×|h| of the tokens from their embeddings: h 1:n =g(e(t 1:n )).A decoder f then generates the hidden state q j =f (q <j , ā<j , h 1:n ) which is used to compute a query vector that locates each phrase of a tuple (r j , ōj , ūj ) at each step.āj =[r j , ōj , ūj ] and they are assumed conditionally independent given previously extracted phrase tuples and the instruction, so p(ā j |ā <j , t 1:n )= ȳ∈{r,ō,ū} p(ȳ j |ā <j , t 1:n ).
Note that ȳj ∈ {r j , ōj , ūj } denotes a specific span for y ∈ {r, o, u} in the action tuple at step j.We therefore rewrite ȳj as y b:d j to explicitly indicate that it corresponds to the span for r, o or u, starting at the bth position and ending at the dth position in the instruction, 1≤b<d≤n.We now parameterize the conditional probability as: As shown in Figure 3, q y j indicates task-specific query vectors for y∈{r, o, u}.They are computed as q y j =φ(q j , θ y )W y , a multi-layer perceptron followed by a linear transformation.θ y and W y are trainable parameters.We use separate parameters for each of r, o and u.W y ∈ R |φy|×|h| where |φ y | is the output dimension of the multi-layer perceptron.The alignment function α(•) scores how a query vector q y j matches a span whose vector representation h b:d is computed from encodings h b:d .
Span Representation.There are a quadratic number of possible spans given a token sequence (Lee et al., 2017), so it is important to design a fixed-length representation h b:d of a variablelength token span that can be quickly computed.Beginning-Inside-Outside (BIO) (Ramshaw and Marcus, 1995)-commonly used to indicate spans in tasks such as named entity recognition-marks whether each token is beginning, inside, or outside a span.However, BIO is not ideal for our task because subsequences for describing different actions can overlap, e.g., in click X and Y, click participates in both actions click X and click Y.In our experiments we consider several recent, more flexible span representations (Lee et al., 2016(Lee et al., , 2017;;Li et al., 2019) and show their impact in Section 5.2.
With fixed-length span representations, we can use common alignment techniques in neural networks (Bahdanau et al., 2014;Luong et al., 2015).We use the dot product between the query vector and the span representation: α(q y j , h b:d )=q y j • h b:d At each step of decoding, we feed the previously decoded phrase tuples, ā<j into the decoder.We can use the concatenation of the vector representations of the three elements in a phrase tuple or the sum their vector representations as the input for each decoding step.The entire phrase tuple extraction model is trained by minimizing the softmax cross entropy loss between the predicted and ground-truth spans of a sequence of phrase tuples.

Grounding Model
Having computed the sequence of tuples that best describe each action, we connect them to executable actions based on the screen at each step with our grounding model (Fig. 4).In step-bystep instructions, each part of an action is often clearly stated.Thus, we assume the probabilities of the operation r j , object o j , and argument independent given their description and the screen.
= p(r j |r j , s j )p(o j |ô j , s j )p(u j |û j , s j ) = p(r j |r j )p(o j |ô j , s j ) We simplify with two assumptions: (1) an operation is often fully described by its instruction without relying on the screen information and (2) in mobile interaction tasks, an argument is only present for the Text operation, so u j =û j .We parameterize p(r j |r j ) as a feedforward neural network: φ(•) is a multi-layer perceptron with trainable parameters θ r .W r ∈R |φr|×|r| is also trainable, where |φ r | is the output dimension of the φ(•, θ r ) and |r| is the vocabulary size of the operations.φ(•) takes the sum of the embedding vectors of each token in the operation description rj as the input: ) where b and d are the start and end positions of rj in the instruction.
Determining o j is to select a UI object from a variable-number of objects on the screen, c j,k ∈ s j where 1≤k≤|s j |, based on the given object description, ôj .We parameterize the conditional probability as a deep neural network with a softmax output layer taking logits from an alignment function: The alignment function α(•) scores how the object description vector ô j matches the latent representation of each UI object, c j,k .This can be as simple as the dot product of the two vectors.The latent representation ô j is acquired with a multi-layer perceptron followed by a linear projection: Contextual Representation of UI Objects.To compute latent representations of each candidate object, c j,k , we use both the object's properties and its context, i.e., the structural relationship with other objects on the screen.There are different ways for encoding a variable-sized collection of items that are structurally related to each other, including Graph Convolutional Networks (GCN) (Niepert et al., 2016) and Transformers (Vaswani et al., 2017).GCNs use an adjacency matrix predetermined by the UI structure to regulate how the latent representation of an object should be affected by its neighbors.Transformers allow each object to carry its own positional encoding, and the relationship between objects can be learned instead.
The input to the Transformer encoder is a combination of the content embedding and the positional encoding of each object.The content properties of an object include its name and type.We compute the content embedding of by concatenating the name embedding, which is the average embedding of the bag of tokens in the object name, and the type embedding.The positional properties of an object include both its spatial position and structural position.The spatial positions include the top, left, right and bottom screen coordinates of the object.We treat each of these coordinates as a discrete value and represent it via an embedding.Such a feature representation for coordinates was used in ImageTransformer to represent pixel positions in an image (Parmar et al., 2018).The spatial embedding of the object is the sum of these four coordinate embeddings.To encode structural information, we use the index positions of the object in the preorder and the postorder traversal of the view hierarchy tree, and represent these index positions as embeddings in a similar way as representing coordinates.The content embedding is then summed with positional encodings to form the embedding of each object.We then feed these object embeddings into a Transformer encoder model to compute the latent representation of each object, c j,k .
The grounding model is trained by minimizing the cross entropy loss between the predicted and ground-truth object and the loss between the predicted and ground-truth operation.

Experiments
Our goal is to develop models and datasets to map multi-step instructions into automatically executable actions given the screen information.As such, we use PIXELHELP's paired natural instructions and action-screen sequences solely for testing.In addition, we investigate the model quality on phrase tuple extraction tasks, which is a crucial building block for the overall grounding quality4 .

Datasets and Metrics
We use two metrics that measure how a predicted tuple sequence matches the ground-truth sequence.
• Complete Match: The score is 1 if two sequences have the same length and have the identical tuple [r j , ôj , ûj ] at each step, otherwise 0. • Partial Match: The number of steps of the predicted sequence that match the ground-truth sequence divided by the length of the groundtruth sequence (ranging between 0 and 1).We train and validate using ANDROIDHOWTO and RICOSCA, and  , where w(•) is a learned weight function for each token embedding (Lee et al., 2017).See the pseudocode for fast computation of these in the appendix.
examples are dynamically stitched to form sequence examples with a certain length distribution.To evaluate the full task, we use Complete and Partial Match on grounded action sequences a 1:m where a j =[r j , o j , u j ].
The token vocabulary size is 59K, which is compiled from both the instruction corpus and the UI name corpus.There are 15 UI types, including 14 common UI object types, and a type to catch all less common ones.The output vocabulary for operations include CLICK, TEXT, SWIPE and EOS.

Model Configurations and Results
Tuple Extraction.For the action-tuple extraction task, we use a 6-layer Transformer for both the encoder and the decoder.We evaluate three different span representations.Area Attention (Li et al., 2019) provides a parameter-free representation of each possible span (one-dimensional area), by summing up the encoding of each token in the subsequence: The representation of each span can be computed in constant time invariant to the length of the span, using a summed area table.Previous work concatenated the encoding of the start and end tokens as the span representation, Lee et al., 2016) and a generalized version of it (Lee et al., 2017).We evaluated these three options and implemented the representation in Lee et al. (2017) using a summed area table similar to the approach in area attention for fast computation.For hyperparameter tuning and training details, refer to the appendix.
Table 1 gives results on ANDROIDHOWTO's test set.All the span representations perform well.Encodings of each token from a Transformer already capture sufficient information about the entire sequence, so even only using the start and end encodings yields strong results.Nonetheless, area attention provides a small boost over the others.As a new dataset, there is also considerable headroom remaining, particularly for complete match.Grounding.For the grounding task, we compare Transformer-based screen encoder for generating object representations h b:d with two baseline methods based on graph convolutional networks.The Heuristic baseline matches extracted phrases against object names directly using BLEU scores.Filter-1 GCN performs graph convolution without using adjacent nodes (objects), so the representation of each object is computed only based on its own properties.Distance GCN uses the distance between objects in the view hierarchy, i.e., the number of edges to traverse from one object to another following the tree structure.This contrasts with the traditional GCN definition based on adjacency, but is needed because UI objects are often leaves in the tree; as such, they are not adjacent to each other structurally but instead are connected through nonterminal (container) nodes.Both Filter-1 GCN and Distance GCN use the same number of parameters (see the appendix for details).
To train the grounding model, we first train the Tuple Extraction sub-model on ANDROIDHOWTO and RICOSCA.For the latter, only language related features (commands and tuple positions in the command) are used in this stage, so screen and action features are not involved.We then freeze the Tuple Extraction sub-model and train the grounding sub-model on RICOSCA using both the command and screen-action related features.The screen token embeddings of the grounding sub-model share weights with the Tuple Extraction sub-model.
Table 2 gives full task performance on PIXEL-HELP.The Transformer screen encoder achieves the best result with 70.59% accuracy on Complete Match and 89.21% on Partial Match, which sets a strong baseline result for this new dataset while leaving considerable headroom.The GCN-based methods perform poorly, which shows the importance of contextual encodings of the information from other UI objects on the screen.Distance GCN does attempt to capture context for UI objects that are structurally close; however, we suspect that the distance information that is derived from the view hierarchy tree is noisy because UI developers can construct the structure differently for the same UI. 5s a result, the strong bias introduced by the structure distance does not always help.Nevertheless, these models still outperformed the heuristic baseline that achieved 62.44% for partial match and 42.25% for complete match.

Analysis
To explore how the model grounds an instruction on a screen, we analyze the relationship between words in the instruction language that refer to specific locations on the screen, and actual positions on the UI screen.We first extract the embedding weights from the trained phrase extraction model for words such as top, bottom, left and right.These words occur in object descriptions such as the check box at the top of the screen.We also extract the embedding weights of object screen positions, which are used to create object positional encoding.We then calculate the correlation between word embedding and screen position embedding using cosine similarity.Figure 5 visualizes the correlation as a heatmap, where brighter colors indicate higher correlation.The word top is strongly correlated with the top of the screen, but the trend for other location words is less clear.While left is strongly correlated with the left side of the screen, other regions on the screen also show high correlation.This is likely because left and right are not only used for referring to absolute locations on the screen, but also for relative spatial relationships, such as the icon to the left of the button.For bottom, the strongest correlation does not occur at the very bottom of the screen because many UI objects in our dataset do not fall in that region.The region is often reserved for system actions and the on-screen keyboard, which are not covered in our dataset.
The phrase extraction model passes phrase tuples to the grounding model.When phrase extraction is incorrect, it can be difficult for the grounding model to predict a correct action.One way to mitigate such cascading errors is using the hidden state of the phrase decoding model at each step, q j .Intuitively, q j is computed with the access to the encoding of each token in the instruction via the Transformer encoder-decoder attention, which can potentially be a more robust span representation.However, in our early exploration, we found that grounding with q j performs stunningly well for grounding RICOSCA validation examples, but performs poorly on PIXELHELP.The learned hidden state likely captures characteristics in the synthetic instructions and action sequences that do not manifest in PIXELHELP.As such, using the hidden state to ground remains a challenge when learning from unpaired instruction-action data.
The phrase model failed to extract correct steps for 14 tasks in PIXELHELP.In particular, it resulted in extra steps for 11 tasks and extracted incorrect steps for 3 tasks, but did not skip steps for any tasks.These errors could be caused by different language styles manifested by the three datasets.Synthesized commands in RICOSCA tend to be brief.Instructions in ANDROIDHOWTO seem to give more contextual description and involve diverse language styles, while PIXELHELP often has a more consistent language style and gives concise description for each step.

Related Work
Previous work (Branavan et al., 2009(Branavan et al., , 2010;;Liu et al., 2018;Gur et al., 2019) investigated approaches for grounding natural language on desktop or web interfaces.Manuvinakurike et al. (2018) contributed a dataset for mapping natural language instructions to actionable image editing commands in Adobe Photoshop.Our work focuses on a new domain of grounding natural language instructions into executable actions on mobile user interfaces.This requires addressing modeling challenges due to the lack of paired natural language and action data, which we supply by harvesting rich instruction data from the web and synthesizing UI commands based on a large scale Android corpus.
Our work is related to semantic parsing, particularly efforts for generating executable outputs such as SQL queries (Suhr et al., 2018).It is also broadly related to language grounding in the human-robot interaction literature where human dialog results in robot actions (Khayrallah et al., 2015).
Our task setting is closely related to work on language-conditioned navigation, where an agent executes an instruction as a sequence of movements (Chen and Mooney, 2011;Mei et al., 2016;Misra et al., 2017;Anderson et al., 2018;Chen et al., 2019).Operating user interfaces is similar to navigating the physical world in many ways.A mobile platform consists of millions of apps that each is implemented by different developers independently.Though platforms such as Android strive to achieve interoperability (e.g., using Intent or AIDL mechanisms), apps are more often than not built by convention and do not expose programmatic ways for communication.As such, each app is opaque to the outside world and the only way to manipulate it is through its GUIs.These hurdles while working with a vast array of existing apps are like physical obstacles that cannot be ignored and must be negotiated contextually in their given environment.

Conclusion
Our new datasets, models and results provide an important first step on the challenging problem of grounding natural language instructions to mobile UI actions.Our decomposition of the problem means that progress on either can improve full task performance.For example, action span extraction is related to both semantic role labeling (He et al., 2018) and extraction of multiple facts from text (Jiang et al., 2019) and could benefit from innovations in span identification and multitask learning.Reinforcement learning that has been applied in previous grounding work may help improve outof-sample prediction for grounding in UIs and improve direct grounding from hidden state representations.Lastly, our work provides a technical foundation for investigating user experiences in language-based human computer interaction.    of each span by using summed area tables.The TensorFlow implementation of the representation is available on Github6 .Algorithm 1 gives the recipe for Start-End Concat (Lee et al., 2016) using Tensor operations.The advanced form (Lee et al., 2017) takes two other features: the weighted sum over all the token embedding vectors within each span and a span length feature.The span length feature is trivial to compute in a constant time.However, computing the weighted sum of each span can be time consuming if not carefully designed.We decompose the computation as a set of summation-based operations (see Algorithm 2 and 3) so as to use summed area tables (Szeliski, 2010), which was been used in Li et al. (2019) for constant time computation of span representations.These pseudocode definitions are designed based on Tensor operations, which are highly optimized and fast.
j |ā <j , t 1:n ) (5)This defines the action phrase-extraction model, which is then used by the grounding model:

Figure 2 :
Figure 2: PIXELHELP example: Open your device's Settings app.Tap Network & internet.Click Wi-Fi.Turn on Wi-Fi..The instruction is paired with actions, each of which is shown as a red dot on a specific screen.

Figure 3 :
Figure3: The Phrase Tuple Extraction model encodes the instruction's token sequence and then outputs a tuple sequence by querying into all possible spans of the encoded sequence.Each tuple contains the span positions of three phrases in the instruction that describe the action's operation, object and optional arguments, respectively, at each step.∅ indicates the phrase is missing in the instruction and is represented by a special span encoding.

Figure 4 :
Figure 4: The Grounding model grounds each phrase tuple extracted by the Phrase Extraction model as an operation type, a screen-specific object ID, and an argument if present, based on a contextual representation of UI objects for the given screen.A grounded action tuple can be automatically executed.
) b and d are the start and end index of the object description ôj .θ o and W o are trainable parameters with W o ∈R |φo|×|o| , where |φ o | is the output dimension of φ(•, θ o ) and |o| is the dimension of the latent representation of the object description.

Figure 5 :
Figure 5: Correlation between location-related words in instructions and object screen position embedding.

Figure 6 :
Figure 6: The web interface for annotators to label action phrase spans in an ANDROIDHOWTO instruction.

Figure 7 :
Figure 7: The distribution of the number of tokens in ANDROIDHOWTO instructions.

Figure 8 :
Figure 8: The distribution of the number of steps involved in ANDROIDHOWTO instructions.

Figure 9 :
Figure 9: The distribution of the length of an object description in ANDROIDHOWTO instructions.
evaluate on PIXELHELP.During training, single-step synthetic command-action