FLIN: A Flexible Natural Language Interface for Web Navigation

AI assistants can now carry out tasks for users by directly interacting with website UIs. Current semantic parsing and slot-filling techniques cannot flexibly adapt to many different websites without being constantly re-trained. We propose FLIN, a natural language interface for web navigation that maps user commands to concept-level actions (rather than low-level UI actions), thus being able to flexibly adapt to different websites and handle their transient nature. We frame this as a ranking problem: given a user command and a webpage, FLIN learns to score the most relevant navigation instruction (involving action and parameter values). To train and evaluate FLIN, we collect a dataset using nine popular websites from three domains. Our results show that FLIN was able to adapt to new websites in a given domain.


Introduction
AI personal assistants, such as Google Assistant, can now interact directly with the UI of websites to carry out human tasks (Tech Crunch, 2019). Users issue commands to the assistant, and this executes them by typing, selecting items, clicking buttons, and navigating to different pages in the website. Such an approach is appealing as it can reduce the dependency on third-party APIs and expand an assistant's capabilities. This paper focuses on a key component of such systems: a natural language (NL) interface capable of mapping user commands (e.g., "find an Italian restaurant for 7pm") into navigation instructions that a web browser can execute.
One way to implement such an NL interface is to map user commands directly into low-level UI actions (button clicks, text inputs, etc.). The UI elements appearing in a webpage are embedded by concatenating their DOM attributes (tag, classes, text, etc.). Then, a scoring function  or a neural policy (Liu et al., 2018a) are * Work done while interning at Microsoft Research. trained to identify which UI element best supports a given command. Learning at the level of UI elements is effective, but only in controlled (UI elements do not change over time (Shi et al., 2017a)) or restricted (single applications (Branavan et al., 2009)) environments. This is not the case in the "real" web, where (i) websites are constantly updated, and (ii) a user may ask an assistant to execute the same task in any website of their choice (e.g., ordering pizza with dominos.com or pizzahut.com). The transient nature and diversity of the web call for an NL interface that can flexibly adapt to environments with a variable and unknown set of actions, without being constantly re-trained.
To achieve this goal, we take two steps. First, we conceptualize a new way of designing NL interfaces for web navigation. Instead of mapping user commands into low-level UI actions, we map them into meaningful "concept-level" actions. Conceptlevel actions are meant to express what a user perceives when glancing at a website UI. In the example shown in Figure 1, the homepage of OpenTable has a concept-level action "Let's go" (where "Let's go" is the label of a search button), which represents the concept of searching something which can be specified using various parameters (a date, a time, a number of people and a search term). Intuitively, websites in a given domain (say, all restaurant websites) share semantically-similar conceptlevel actions and the semantics of a human task tend to be time invariant. Hence, learning at the level of concept-level actions can lead to a more flexible NL interface. However, while concept-level actions vary less than raw UI elements, they still manifest with different representations and parameter sets across websites. Searching a restaurant in opentable.com, for example, corresponds to an action "Let's go" which supports up to four parameters; in yelp.com, the same action is called "search" and supports two parameters (search term and location .. c N Figure 1: Web task execution driven by NL commands in the OpenTable website. The user command is mapped to the concept-level action "Let's go" whose execution causes the transition from the home to the search results page. one domain may also have different action types (e.g., making a table reservation vs. ordering food).
Our second insight to tackle this problem is to leverage semantic parsing methods in a novel way. Traditional semantic parsing methods (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2007;Branavan et al., 2009;Lau et al., 2009;Thummalapenta et al., 2012) deal with environments that have a fixed and known set of actions, hence cannot be directly applied. Instead, we propose FLIN, a new semantic parsing approach where instead of learning how to map NL commands to executable logical forms (as in traditional semantic parsing), we leverage the semantics of the symbols (name of actions/parameters and parameter values) contained in the logical form (the navigation instruction) to learn how to match it against a given command. Specifically, we model the semantic parsing task as a ranking problem. Given an NL command c and the set of actions A available in the current webpage, FLIN scores the actions with respect to c. Simultaneously, for each parameter p of an action, it extracts a phrase m from c that expresses a value of p, and then scores p's values with respect to m to find the best value assignment for p. Each action with its associated list of parameter value assignments represents a candidate navigation instruction to be ranked. FLIN learns a net score for each instruction based on corresponding action and parameter value assignment scores, and outputs the highest-scored instruction as the predicted navigation instruction.
To collect a dataset for training and testing FLIN, we built a simple rule-based Action Extractor tool that extracts concept-level actions along with their parameters (names and values, if available) from webpages. The implementation and evaluation of this tool is out of scope for this paper. 1 In a complete system, illustrated in Figure 1, we envision the Action Extractor to extract and pass the concept-level actions present in the current webpage to FLIN, which computes a candidate navigation instruction N to be executed by an Action Executor (e.g., a web automation tool such as Selenium (2020) or Ringer (Barman et al., 2016)). Overall, we make the following contributions: (1) we conceptualize a new design approach for NL interfaces for web navigation based on conceptlevel actions; (2) we build a match-based semantic parser to map NL commands to navigation instructions; and (3) we collect a new dataset based on nine websites (from restaurant, hotel and shopping domains) and provide empirical results that verify the generalizability of our approach. Code and dataset are available at https://github.com/microsoft/flin-nl2web.
Work on NL-guided web task execution includes learning from demonstrations (Allen et al., 2007), building reinforcement learning agents (Shi et al., 2017b;Liu et al., 2018a), training sequence to sequence models to map natural language commands into web APIs (Su et al., 2017(Su et al., , 2018, and generating task flows from APIs (Williams et al., 2019). These techniques assume different problem settings (e.g., reward functions) or deal with low-level web actions or API calls. Unlike FLIN, they do not generalize across websites.  propose an embeddingbased matching model to map natural language commands to low-level UI actions such as hyperlinks, buttons, menus, etc. Unlike FLIN, this work does not deal with predicting parameter values (i.e., actions are un-parametrized).
Models that jointly perform intent detection and slot filling (Guo et al., 2014;Liu and Lane, 2016;Chen et al., 2019) are not applicable to our problem for three reasons. First, they are trained on a per-application basis using application-specific intent and slot labels, and thus cannot generalize across websites. Second, they semantically label words in an utterance, but do not do value assignment, hence they cannot output executable navigation paths. Third, they perform multi-class classification (i.e., they assume only one intent to be true for a user query) and have no notion of state (e.g., current webpage). They do not deal with intents with overlapping semantics which may occur across pages of the same website (e.g., in the example shown in Figure 1, the same user query may map to the action "Let's go" in the OpenTable's home page or to the action "Find a Table" in the search results page).

Problem Formulation
Let A w = {a 1 , a 2 , ..., a n } be the set of conceptlevel actions available in a webpage w. Each action a ∈ A w is defined by an action name n a and a set of K parameters P a = {p 1 , p 2 , ..., p K }. Each parameter p ∈ P a is defined by a name 2 and a domain dom(p) (i.e., a set of values that can be assigned to parameter p), and can be either closed domain or open domain.
For closed-domain parameters, the domain is bounded and consists of a finite set of values that p can take; the set is imposed by the website UI, such as the available colors and sizes for a product item or the available reservation times for a restaurant.
For open-domain parameters, the domain is, in principle, unbounded, but, in practice, it consists of all words/phrases which can be extracted from an NL command c. With reference to Figure 1, the "let's go" (search) action has n a ="let's go" and P a = {"time", "date", "people", "location, restaurant, or cuisine"}. The first three parameters are closed domain and the last one (the search term) is open domain. The Action Extractor module (Figure 1) names actions and parameters after labels and texts appearing in the UI (or, if absent, using DOM attributes); it also automatically scrapes values of closed-domain parameters (from drop-down menus or select lists).
Given the above setting, our goal is to map an NL command c issued in w into a navigation instruction N , consisting of a correct action name n a * corresponding to action a * ∈ A w and an associated list of m≤|P a * | correct parameter-value assignments, given by

The FLIN Model
The task of solving the above semantic parsing problem can be decomposed into two sub-tasks: (i) action recognition, i.e., recognizing the action a ∈ A w intended by c, and (ii) parameter recognition and value assignment, i.e., deciding whether a parameter of an action is expressed in c and, if so, assigning the value to that parameter. A parameter is expressed in c by a mention (word or phrase). For example, in Figure 1, "me and my friend" is a mention of parameter "people" in c and a correct parsing should map it to the domain value "2 people". Thus, the second sub-task involves first let' s go (! " : action name)

people
Domain value of parameter "people" word embed.

[CLS] parameter name [SEP] command [SEP]
Char-level word repr. learning extracting a mention of a given parameter from c, and then matching it against a set of domain values to find the correct value assignment. For an open-domain parameter, the extracted mention becomes the value of the parameter and no matching is needed, e.g. in Figure 1, the mention "Italian" should be assigned to the parameter "location, restaurant, or cuisine". With reference to Figure 2, FLIN consists of four components, designed to solve the aforementioned two sub-tasks: (1) Action Scoring, which scores each available action with respect to the given command ( §4.1); (2) Parameter Mention Extraction, which extracts the mention (phrase) from the command for a given parameter ( §4.2); (3) Parameter Value Scoring, which scores a given mention with a closed-domain parameter value or rejects it if no domain values can be mapped to the mention ( §4.3); and (4) Inference (not shown in the figure), which uses the scores of actions and parameter values to infer the action-parameter-value assignment with the highest score as the predicted navigation instruction ( §4.4).

Action Scoring
Given a command c, we score each action a ∈ A w to measure the similarity of a and c's intent. We loop over the actions in A w and their parameters to obtain a list of action name and parameters pairs (n a , P a ), and then score them with respect to c.
To score each pair (n a , P a ), we learn a neural network based scoring function S a (.) that computes its similarity with c. We represent c as a sequence {w 1 , w 2 , ..., w R } of R words. To learn a vector representation of c, we first convert each w i into corresponding one-hot vectors x i , and then learn embeddings of each word using an embed- is the word vocabulary. Next, given the word embedding vectors {v i | 1 ≤ i ≤ |R|}, we learn the forward and backward representation using a Bi-LSTM network (Schuster and Paliwal, 1997). Let the final hidden state for forward LSTM and backward LSTM, after Next, we learn a vector representation of (n a , P a ). We use the same word embedding matrix E w and Bi-LSTM layer to encode the action name n a into a vector v a =BiLST M (n a ). Similarly, we encode each parameter p ∈ P a into a vector v p =BiLST M (p), and compute the net parameter semantics of action a as the mean of the parameter vectors (v p =mean{v p | p ∈ P a }). Finally, to learn the overall semantic representation of (n a , P a ), we concatenate v a and v p and learn a combined representation using a feed-forward (FF) where W a ∈ R 4d×2d and b a ∈ R 2d are weights and biases of the FF layer, respectively.
Given v c and v ap , we compute the intent similarity between c and (n a , P a ) using cosine similarity: where || · || denotes euclidean norm of a vector. S a (·) ∈ [0, 1] is computed for each a ∈ A w (and is used in inference, §4.4).
The parameters of S a (.) are learned by minimizing a margin-based ranking objective L a , which encourages the scores of each positive (n a , P a ) pair to be higher than those of negative pairs in w: where Q + is a set of positive (n a , P a ) pairs in w and Q − is a set of negative (n a , P a ) pairs obtained by randomly sampling action name and parameter pairs (not in Q + ) in w.

Parameter Mention Extraction
Given a command c and a parameter p, the goal of this step is to extract the correct mention m p of p from c. In particular, we aim to predict the text span in c that represents m p . We formulate this task as a question-answering problem, where we treat p as a question, c as a paragraph, and m p as the answer. We fine-tune a pre-trained BERT (Devlin et al., 2019) model 3 to solve this problem.
As shown in Figure 2 (bottom-right), we represent p and c as a pair of sentences packed together into a single input sequence of the form [ (Wu et al., 2016). From BERT, we obtain T i as output token embedding for each token i in the packed sequence. We only introduce a mention start vector S ∈ R H and a mention end vector E ∈ R H during fine-tuning. The probability of word i being the start of the mention is computed as a dot product between T i and S followed by a 3 https://tfhub.dev/tensorflow/bert_en_uncased_L-12_ H-768_A-12/1 softmax over all of the words in c: P i = e S·T i j e S·T j . The analogous formula is used to compute the end probability P j . The position i (position j > i) with highest start (end) probability P i (P j ) is predicted as start (end) index of the mention and the corresponding tokens in c are combined into a word sequence to extract m p . We fine-tune BERT by minimizing the training objective as the sum of the log-likelihoods of the correct start and end positions. We train BERT to output [CLS] as m p , if p is not expressed in c (no mention is identified, hence p is discarded from being predicted).

Parameter Value Scoring
Once the mention m p is extracted for a closeddomain parameter p, we learn a neural network based scoring function S p (·) to score each p's value v ∈ dom(p) with respect to m p . If p is opendomain, parameter value scoring is not needed.
The process is similar to that of action scoring, but, in addition to word-level semantic similarity, we also compute character-level and lexical-level similarity between v and m p . In fact, v and m p often have partial lexical matching. For example, given the domain value "7:00 PM" for the parameter "time", possible mentions may be "7 in the evening", "19:00 hrs", "at 7 pm", etc., where partial lexical-level similarity is observed. However, learning word-level and character-level semantic similarities is also important as "PM" and "evening" as well as "7:00 PM" and "19:00" are lexically distant to each other, but semantically closer.
Word-level semantic similarity. We use the same word embedding matrix E w used in action scoring to learn the word vectors for both m p and v . We use a Bi-LSTM layer (not shared with Action Scoring) to encode mention (value) into a wordlevel representation vectors v wd m (v wd v ). We compute the word-level similarity between m p and v as Character-level semantic similarity. We use a character embedding matrix E c to learn the character vectors for each character composing the words in m p and v . To learn the character-level vector representation v char m of m p , we first learn the word vector for each word in m p by composing the character vectors in sequence using an LSTM network (Hochreiter and Schmidhuber, 1997), and then compose the word vectors for all mention words using a BiLSTM layer to obtain v char m . Similarly, the character-level vector representation v char v of v is obtained. Next, we compute the character-level similarity between m p and v as Lexical-level similarity. We use a fuzzy string matching score (using the Levenshtein distance to calculate the differences between sequences) 4 and a custom value matching score which is computed as the fraction of words in v that appear in m p ; then we compute a linear combination of these similarity scores (each score ∈ [0, 1]) as the net lexical-level similarity score, denoted as S lex p (m p , v ) ∈ [0, 1]. Net value-mention similarity score. It is the mean of the three scores above: The parameters of S p (·) are learned by minimizing a margin-based ranking objective L p , which encourages the scores S p (.) of each mention and positive value pair to be higher than those of mention and negative value pairs for a given p, and can be defined following the previously defined L a (See eq. 3).

Inference
The inference module takes the outputs of Action Scoring, Parameter Mention Extraction and Parameter Value Scoring to compute a net score S ap (.) for each action a ∈ A w and associated list of parameter value assignment combinations. Then, it uses S ap (.) to predict the navigation instruction.
Parameter value assignment. We first infer the value to be assigned to each p ∈ P a , where the predicted valuev p for a closed-domain p is given byv Here, ρ is a threshold score (tuned empirically) for parameter value prediction.
While performing the value assignment for p, we consider S p (m p ,v p ) as the confidence score for p's assignment. If S p (m p , v ) < ρ for all v ∈ dom(p), we consider the confidence score for p as 0, implying m p refers to a value which does not exist in dom(p); p is discarded and no value assignment is done. If p is an open-domain parameter, m p is inferred asv p with a confidence score of 1. 4 pypi.org/project/fuzzywuzzy/ If all p ∈ P a are discarded from the prediction for a parametrized action a ∈ A w , we discard a, as a no longer becomes executable in w.
Once we get all confidence scores for all value assignments for all p ∈ P a , we compute the average confidence score S p (P a ), and consider it to be the net parameter value assignment score for a.
Navigation instruction prediction. Finally, we compute the overall score for a given action a and associated list of parameter value assignments as S ap = α * S a (c, n a , P a ) + (1 − α) * S p (P a ), where α is a linear combination coefficient empirically tuned. The predicted navigation instruction for command c is the action and associated parameter value assignments with the highest S ap score.

Evaluation
We evaluate FLIN on nine popular websites from three representative domains: (i) Restaurants (R), (ii) Hotels (H), and (iii) Shopping (S). We collect labelled datasets for each website ( §5.1) and perform in-domain cross-website evaluation. Specifically, we train one FLIN model for each domain using one website, and test on the other (two) websites in the same domain. Ideally, a single model could be trained by using the training data of all three domains and applied to all test websites, but we opt for domain-specific training/evaluation to better analyze how FLIN leverages the semantic overlap of concept-level actions (that exists across in-domain websites) to generalize to new websites. We discard cross-domain evaluation because the semantics of actions and parameters do not significantly overlap across our three domains.

Experimental Setup
To train and evaluate FLIN, we collect two datasets: (i) WebNav consists of (English) command and navigation instruction pairs, and (ii) DialQueries consists of (English) user utterances extracted from existing dialogue datasets paired with navigation instructions.
To collect WebNav, given a website and a task it supports, we first identity which pages are related to the task. For example, in OpenTable we find 8 pages related to the task "making a restaurant reservation": the page for searching restaurants, for browsing search results, for viewing a restaurant's profile, for submitting a reservation, etc. Then, using our Action Extractor tool, we enumerate all actions present in each task-related page.
For each action the extractor provides a name, parameters and parameter values, if any. The action names are inferred from various DOM attributes (aria-label, value, placeholder, etc.) and text associated with the relevant DOM element. The goal of the Action Extractor is to label UI elements as humans see them. For example, the search box in the OpenTable website, (instead of being called "search input") is called "Location, Restaurant or Cuisine", which is in fact the placeholder text associated with that input, and what users see in the UI. Parameter values are scraped automatically from DOM select elements (e.g., option value tag). We manually inspect the output of the Action Extractor and correct possible errors (e.g., missing actions). However, for every website, we obtain a different action/parameter scheme. There is no generalized mapping between similar actions/parameters across websites as building such mapping would require significant manual effort. Table 1 reports number of pages, actions and parameters extracted for all websites used in our experiments.
With this data, we construct <page_name, ac-tion_name, [parameter_name]> triplets for all actions across all websites, and we ask two annotators to write multiple command templates corresponding to each triplet with parameter names as placeholders. A command template may be "Book a table for <time>". For closed-domain parameters, the Action Extractor automatically scrapes their values from webpages (e.g., {12:00 pm, 12:15 pm, etc.} for the time parameter), and we ask annotators to provide paraphrases for them (e.g., "at noon"). For open-domain parameters, we ask annotators to provide example values (e.g., "pizza" for a restaurant search term). We assemble the final dataset by instantiating command templates with randomlychosen parameter value paraphrases, and then split it into train, validation and test datasets. Overall, we generate a total of 53,520 command and navigation instruction pairs. We use train and validation splits for opentable.com, hotels.com and rei.com for model training. Table 1 summarizes the sizes of the train, validation and test splits for all websites.
The second dataset, DialQueries, consists of real user queries extracted from the SGD dialogue dataset (Datasets, 2020) and from Restaurants, Hotels and Shopping "pre-built agents" of Dialogflow (dialogflow.com). We extract queries that are mappable to our website's tasks and adapt them by replacing out-of-vocabulary mentions of restaurants, Training details. All hyper-parameters are tuned on the validation set. Batch-size is 50. Number of training epochs for action scoring is 7, for parameter mention extraction is 3, and for parameter value scoring is 22. One negative example is sampled for Q − (in Eq. 3) in every epoch. Dropout is 0.1. Hidden units and embedding size are 300. Learning rate is 1e-4. Regularization parameter is 0.001, ρ = 0.67 and α = 0.4 ( §4.4). The Adam optimizer (Kingma and Ba, 2014) is used for optimization. We use a Tesla P100 GPU and Tensor-Flow for implementation.
Compared models. There is no direct baseline for this work as related approaches differ in the type of output or problem settings. As discussed in §2,  do not perform parameter recognition and value assignments. Liu et al. (2018a) require a reward function for neural policy learning. Joint intent detection and slot filling models perform multi-class classification and do not consider the current state (current webpage), thus not being able to deal with similar intents in different webpages; further, they perform slot filling (equivalent to parameter mention extraction) but do not perform parameter value assignments, thus being unable to output executable paths for web navigation.
Nonetheless, we compare FLIN against two of its variants that use its match-based semantic parsing approach but with the following differences: (i) FLIN-sem uses only word-level and characterlevel semantic similarity for parameter value scoring (no lexical similarity); and (ii) FLIN-lex uses Evaluation metrics. We use accuracy (A-acc) to evaluate action prediction and average F1 score (P-F1) to evaluate parameter prediction performance. P-F1 is computed using the average parameter precision and parameter recall over test commands 5 . Given a command, parameter precision is computed as the fraction of parameters in the predicted instruction which are correct and parameter recall as the fraction of parameters in the gold instruction which are predicted correctly. If the predicted action is incorrect or no action has been predicted for a given test command, we consider both parameter precision and recall to be 0 for the command.
We also compute (i) Exact Match Accuracy (EMA), defined as the percentage of test commands where the predicted instruction exactly matches the gold navigation instruction, and (ii) 100% Precision Accuracy (PA-100), defined as the percentage of test commands for which the parameter precision is 1.0 and the predicted action is correct, but parameter recall ≤ 1.0. 6 Similar to parameter precision and recall, while computing EMA and PA-100 for test commands, if the predicted action is incorrect or no action has been predicted for a test command, we consider both the exact match and PA-100 value to be 0 for that command. 5 The P-F1 (parameter F1 score) is computed as the "harmonic mean" of average parameter precision and average parameter recall (averaged over all test queries). 6 Although we formulate the mapping problem as a ranking one, we do not consider standard metrics such as mean average precision (MAP) or normalized discounted cumulative gain (NDCG) because FLIN outputs only one navigation instruction (instead of a ranked list), given that in a real web navigation system only one predicted action can be executed. Table 2 reports the performance comparison of FLIN on the WebNav dataset. We evaluate both inwebsite (model trained and tested on the same website, 2 nd -5 th columns) and cross-website (model trained on one website and tested on a different one, 6 th -13 th columns) performance. FLIN and its two variants adapt relatively well to previously-unseen websites thanks to FLIN's match-based semantic parsing approach. FLIN achieves best overall performance, and is able to adapt to new websites by achieving comparable (or higher) action accuracy (A-acc) and parameter F1 (P-F1) score. Considering PA-100, in the Restaurants domain, 75.6% of commands in the training website (OpenTable), and 60.3% (bookatable) and 82.4% (yelp) of commands in the two test websites are mapped into correct and executable actions (no wrong predictions). PA-100 is generally high also for the other two domains. EMA is lower than PA-100, as it is much harder to predict all parameter value assignments correctly. Regarding the FLIN variants, both FLIN-sem and FLIN-lex generally perform worse than FLIN because by combining both lexical and semantic similarity FLIN can be more accurate in doing parameter value assignments and generalize better.

Performance Results
From a generalizability point of view, the most challenging domain is Hotels. While the performance of action prediction (A-acc) for Hotels is in the 45.5%-93.9% range, EMA is in the 14.6%-64.3% range. The drop is mainly due to commands with many parameters (e.g., check-in date, checkout date, number of rooms, etc.), definitely more than in the queries for the other two domains. Shopping is more challenging than Restaurants because We also test FLIN on the real user queries of the DialQueries dataset available for three websites. As Table 3 shows, despite FLIN not being trained on DialQueries, overall its A-acc is above 50% and its PA-100 is above 46% which demonstrate the robustness of FLIN in the face of new commands.

Error Analysis
We randomly sampled 135 wrongly-predicted Web-Nav commands (15 for each of the 9 websites), and classified them into 5 error types (see Table 4).
Overall, 13% of the failures were cases in which an action was not predicted (e.g., for the command "only eight options" with ground truth "filter by size{'size'='8'}", no action was predicted). 29% of the failures were action miss-predictions (the predicted action did not match the gold action) mainly caused by multiple actions in the given webpage having overlapping semantics. E.g., the command "options for new york for just 6 people and 2 kids" got mapped to "select hot destination{'destination'= 'new york'}" instead of "find hotel{'adults'='6'; 'children'='2'; 'destination'='new york'}". Similarly, "apply the kids' shoes filter" got mapped to "filter by gen-der{'gender'= 'kids'}" instead of "filter by cate-gory{'category': 'kids footwear'}".
Together with action miss-predictions, failures in identifying closed-domain parameters (third row in Table 4) were the most common, especially in hotels and shopping websites. This is because these in-domain websites tend to have a more diverse action and parameter space than that for restaurant websites, thus leading to action and parameter types that were not observed in training data. For example, the search action in Hyatt has special rates and use points parameters, not present in that of Hotels.com (training site); or eBay has an action filter by style not present in Rei (training site). Failures in predicting the value of a correctlyidentified closed-domain parameter mention (forth row in the table) were often due to morphological variations in parameter values not frequently observed in training (e.g. "8:00 in the evening" got mapped to the value '18:00' instead of '20:00').
Errors in extracting open-domain parameters were due to parameter names too generic (e.g. "search keyword"), extracted mentions partially matching gold mentions (e.g., "hyatt" vs. the gold mention "hyatt regency grand cypress"), or multiple formats of the parameter values (e.g., various formats for phone number or zip code).

Conclusion
To generalize to many websites, NL-guided web navigation assistants require an NL interface that can work with new website UIs without being re-trained each time. To this end, we proposed FLIN, a matching-based semantic parsing approach that maps user commands to concept-level actions. While various optimizations are possible, FLIN adapted well to new websites and delivered good performance. We have used it in restaurant, shopping and hotels websites, but its design can apply to more domains.