Decoupling Structure and Lexicon for Zero-Shot Semantic Parsing

Building a semantic parser quickly in a new domain is a fundamental challenge for conversational interfaces, as current semantic parsers require expensive supervision and lack the ability to generalize to new domains. In this paper, we introduce a zero-shot approach to semantic parsing that can parse utterances in unseen domains while only being trained on examples in other source domains. First, we map an utterance to an abstract, domain independent, logical form that represents the structure of the logical form, but contains slots instead of KB constants. Then, we replace slots with KB constants via lexical alignment scores and global inference. Our model reaches an average accuracy of 53.4% on 7 domains in the OVERNIGHT dataset, substantially better than other zero-shot baselines, and performs as good as a parser trained on over 30% of the target domain examples.


Introduction
Semantic parsing, the task of mapping natural language utterances into executable logical forms, is a key paradigm in developing conversational interfaces (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Kwiatkowski et al., 2011;. The recent success of conversational interfaces such as Amazon Alexa, Google Assistant, Apple Siri, and Microsoft Cortana has led to soaring interest in developing methodologies for training semantic parsers quickly in any new domain and from little data. Prior work focused on alleviating data collection by training from weak supervision (Clarke et al., 2010;Liang et al., 2011;Kwiatkowski et al., 2013;, or developing protocols for fast data collection through paraphrasing (Berant and Liang, 2014;Wang et al., 2015) or a human-in-the-loop (Iyer et al., 2017). Figure 1: A test utterance is delexicalized (1) and mapped to its abstract logical form (2). Slots ("$" variables) are then aligned to the abstract utterance (3), and are filled with the top assignment in terms of local and global scores (4). Logical forms throughout this paper are in λ-DCS (Liang, 2013).
However, all these approaches rely on supervised training data in the target domain and ignore data collected previously for other domains.
In this paper, we propose an alternative, zeroshot approach to semantic parsing, where no labeled or unlabeled examples are provided in the target domain, but annotated examples from other domains are available. This is a challenging setup as in semantic parsing each dataset is associated with its own knowledge-base (KB) and thus all target domain KB constants (relations and entities) are unobserved at training time. Moreover, this is a natural use-case as more and more conversational interfaces are developed in multiple domains.
Our approach is motivated by recent work (Herzig and Berant, 2017;Su and Yan, 2017;Fan et al., 2017;Richardson et al., 2018) that showed that while the lexicon and KB constants in dif-ferent domains vary, the structure of language composition repeats across domains. Therefore, we propose that by abstracting away the domainspecific lexical items of an utterance, we can learn to map the structure of an abstract utterance to an abstract logical form that does not include any domain-specific KB constants, using data from other domains only. Figure 1 illustrates this approach. A test utterance in the target domain is delexicalized and mapped to an abstract, domain-independent representation, where some content words are replaced by abstract tokens (step 1). Then, a structuremapping model maps this representation into an abstract logical form that contains slots instead of KB constants (step 2). A major technical challenge at this point is to replace slots in the abstract logical form with KB constants from the target domain. We show that it is possible to learn a domain-independent lexical alignment model that aligns each slot to a word in the original utterance (step 3). This alignment, combined with a global inference procedure (step 4) allows one to find the best assignment of KB constants and produce a final logical form. Importantly, both of our models are trained from data in other domains only.
We show that our zero-shot framework parses 7 different unseen domains from the OVERNIGHT dataset with an average denotation accuracy of 53.4%. This result dramatically outperforms several natural baselines, and achieves the same result as training a parser on over 30% of the fully supervised target domain examples. To our knowledge, this work is the first to train a zero-shot semantic parser that can handle unseen domains. All our code is available at https: //github.com/jonathanherzig/ zero-shot-semantic-parsing.

Background
Neural Semantic Parsing Sequence-tosequence models (Sutskever et al., 2014) were recently proposed for semantic parsing (Jia and Liang, 2016;Dong and Lapata, 2016). In this setting, a sequence of input language tokens x 1 , . . . , x m is mapped to a sequence of output logical tokens z 1 , . . . , z n . We briefly review the model by Jia and Liang (2016), which we use as part of our framework, and also as a baseline.
The encoder is a BiLSTM (Hochreiter and Schmidhuber, 1997) that converts x 1 , . . . , x m into a sequence of context sensitive states. The attention-based decoder (Bahdanau et al., 2015;Luong et al., 2015) is an LSTM language model additionally conditioned on the encoder states. Formally, the decoder is defined by: where s j are decoder states, U and the embedding function φ (out) are the decoder parameters, and the context vector, c j , is the result of global attention (Luong et al., 2015). We also employ attention-based copying (Jia and Liang, 2016), but omit details for brevity.
Semantic Parsing over Multiple KBs Recently, Herzig and Berant (2017), Su and Yan (2017) and Fan et al. (2017) proposed to exploit structural regularities in language across different domains. These works pooled together examples from multiple datasets in different domains, each corresponding to a separate KB, and trained a single sequence-to-sequence model over all examples, sharing parameters across domains. They showed that this substantially improves parsing accuracy. While these works implicitly capture linguistic regularities across domains, they rely on annotated data in the target domain. We, conversely, explicitly decouple structure mapping from the assignment of KB constants, and thus can tackle the zero-shot setting where no target domain examples are available. This is the focus of the next section.

Overview
Following the empirical success of sharing structural information between different semantic parsing domains, we propose in this paper to take a more radical approach and to explicitly decouple semantic parsing into a structure mapping model and a lexicon mapping model. We now provide an overview of our approach and explain how this decoupling facilitates zero-shot semantic parsing.
We assume access to D different source domains, where for every domain d we receive a KB K d , and a training set of pairs of utterances and logical forms We further assume a lexicon L that maps each KB constant in K d to a short phrase that describes it (e.g., L(PubYear)→"publication year"), as in Wang  (2015). Finally, we assume a pre-trained, static, embedding function φ(w) ∈ R f for every word w, used to measure cross-domain lexical semantic similarity. Our goal is to train a semantic parser that maps a new utterance x to the correct logical form z from a new domain d new given K dnew . Figure 2 describes the flow of our training procedure: we first employ a simple rule-based method to transform training examples to an abstract representation, where content words (in utterances) and KB constants (in logical forms) are delexicalized. We then train the following two models that decouple structure from lexicon (a) The structure mapper that maps abstract utterances to abstract logical forms. (b) The aligner that provides an alignment from abstract logical form tokens to abstract utterance tokens. Training the aligner is challenging because no gold alignments between the abstract utterance and abstract logical form are available. To overcome this challenge we propose a distillation strategy: we obtain noisy supervision by training a state-of-the-art unsupervised alignment model on the D source domains. Then, we train a second supervised alignment model that receives abstract utterances, abstract logical forms, and target noisy alignments as input and learns to predict the noisy alignments.
Once the two models are trained, we can tackle a new domain without training examples (Figure 1). Given an utterance from the target domain, we first abstract it using the delexicalizer, and then predict its abstract structure using the structure Lexical representation "What meetings have no more than 3 attendees?" Type.Meeting R[λx.count(Attendee.x)].≤.3 "Which recipe needs no more than two ingredients?" mapper. We treat delexicalized logical form tokens as slots to be filled with KB constants. Candidate assignments are then scored locally according to the semantic similarity of a KB constant (represented by its entry value in the lexicon L) to words the slot aligns to according to the aligner. For this we use the pre-trained embedding function φ(·) as the only cross-domain information. Finally, we choose a final assignment of KB constants by exactly maximizing a global scoring function, which takes into account both local alignment scores as well as global constraints.
We next describe in detail the four components of our framework: the delexicalizer, structure mapper, aligner, and inference procedure.

Delexicalizer
The goal of the delexicalizer is to strip utterances and logical forms from their domain-specific components and preserve domain-independent parts. We note that it is possible that some words contain both domain-specific and domain-general aspects ("cheapest"). However, we conjecture that it is possible to decompose examples in a manner that enables zero-shot semantic parsing.
The output of the delexicalizer is an abstract representation that should manifest structural linguistic regularities across domains ( Figure 3). For example, a comparative structure will correspond to the same abstract logical form in different domains. In this representation, used as input to our models, content words and KB constants are transformed to an abstract type. This rule-based preprocessing step is applied to all D source domain training examples (utterances and logical forms), and to target domain utterances at test time. We now describe the process of delexicalization.

Source
Category Abstract Type Examples Utterance Noun NOUN "cuisines", "housing", "time" Verb VERB "published", "born", "posted" Adjective ADJ "high", "cooking", "monthly" Number NUM "4", "three" Date DATE "2018", "january 2nd" Entity ENT "midtown", "alice", "dinner" Logical Form Number  Utterances Table 1 describes the full list of abstraction rules. We delexicalize several categories of content words and keep function words, which describe the utterance structure, in their lexicalized form. Specifically, any verb 1 whose lemma is not "be" or "do" is delexicalized. All nouns are delexicalized, except for a small vocabulary of three words ("average", "total", and "number"), which denote a domain-general operation. Adjectives tend to distribute more evenly between domain-specific words and domain-general words, thus discriminating them is harder (e.g., "outdoor", "wide" and "cooking" are domain-specific words while "minimum", "same" and "many" are domain-general words). Thus, we take a statistical approach and only delexicalize adjectives that are unique to the domain (i.e., did not appear in the training set of any other source domain). We also delexicalize dates and numbers, and identify entities in the utterance by string matching against the entities in the KB. These are then delexicalized to their corresponding abstract type (Table 1).

Logical Forms
We delexicalize all KB constants to their abstract type, which is given as part of the KB schema (Table 1).

Structure Mapper
As a first step towards predicting the lexical logical form, we map an abstract utterance, to an abstract logical form. The model is the neural semantic parser described in Section 2, only here the input and output are the abstract examples in all D domains, which the delexicalizer outputs. The model utilizes a single encoder-decoder pair shared across all domains. As Figure 3 suggests, the model should learn, e.g., that a noun modified 1 Numbers, dates and part-of-speech tags are extracted using Stanford CoreNLP . by a wh-question often maps to $ENT TYPE, and that "no more than" maps to the ≤ operator.

Aligner
The output of the structure mapper is an abstract logical form that contains slots instead of KB constants. To predict a complete logical form, we must assign a KB constant to each slot.
We observe that the description of a KB constant that appears in the logical form (Article) is often semantically similar to some word in the utterance ("paper"). Thus, we can obtain signal for the identity of a KB constant by solving an alignment problem: each slot can be aligned to words in the utterance that have similar meaning to that of the gold KB constant. Naturally, in some cases a KB constant is not semantically similar to any utterance word (e.g., the relation Field in Figure 1), which we will mediate by using a global inference procedure (Section 3.5).
Thus, our goal is to learn a model that given an abstract utterance-logical form pair (x abs , z abs ) produces an alignment matrix A, where A ij corresponds to the alignment probability p(x abs j | z abs i ). A central challenge is that no gold alignments are provided in any domain. Therefore, we adopt a "distillation approach", where we train a supervised model over abstract examples to mimic the predictions of an unsupervised model that has access to the full lexicalized examples.
Specifically, we use a standard unsupervised word aligner (Dyer et al., 2013), which takes all in all D domains and produces an Alignment matrix A * for every example, where A * ij = 1 iff token i in the logical form is aligned to token j in the utterance. Then, we treat A * as gold alignments and gener-ate examples (x abs , z abs , A * ) to train the aligner. Learning alignments over abstract representations is possible, as a slot in a specific context tends to align to specific types of abstract words (e.g., Figure 3 suggests that a relation that is aggregated, often aligns to the NOUN that appears after the NUM in the abstract utterance).
We now present our alignment model, depicted in Figure 4. The model uses two different BiLSTMs to encode x abs and z abs to their context sensitive states b 1 , . . . , b m and s 1 , . . . , s n respectively. We model the alignment probability p align (x abs j | z abs i ) with a bi-linear form similar to attention (Luong et al., 2015): where the parameters W are learned during training. We train the model to minimize the negative log-likelihood of gold alignments while considering only alignments of slots (since we only align slots at test time). The cross-entropy loss for a training example (x abs , z abs , A * ) is then given by: where S z are the slot indices in z abs . Our model can be viewed as an attention model, dedicated to aligning logical form tokens to utterance tokens. Using a separate alignment model rather than the attention weights of the structure mapper has two advantages: First, alignments are generated given the entire generated sequence z abs rather than just a prefix. Second, our model focuses its capacity on the alignment task without worrying about generation of z abs . In Section 4, we will demonstrate that training a dedicated aligner substantially improves performance.

Inference
The aligner provides a distribution over utterance tokens for every slot in the abstract logical form. To compute the final logical form, we must replace each slot with a KB constant. Formally, let (z abs j 1 , . . . , z abs j l ) be the sequence of slots in z abs and denote them for simplicity as y = (y 1 , . . . , y l ). Our goal is to predict a sequence of KB constants c = (c 1 , . . . , c l ), where each c i is chosen from a candidate set C(y i ) that is determined by the abstract token y i according to Table 1 (e.g., if y i is $REL, then C(y i ) is the set of binary relations).
Our scoring function depends on alignments computed by the aligner. However, because slots are independent in the aligner, we introduce a few global constraints that capture the dependence between different slots. Formally, we wish to find c * that maximizes the following scoring function, which depends on the utterance x, the slot sequence y, the abstract logical form z abs , the alignment matrix A and the embedding function φ: We now describe our scoring functions in detail.
Local Score Because inference is applied only at test time, we have access to the lexicalized utterance and not only the abstract one. Thus, the aligner outputs a distribution over words for a slot y (e.g., in Figure 1, $REL DATE aligns with high probability to VERB, which corresponds to the word "published"). Each word, in turn, has different semantic similarity to each KB constant in C(y). Intuitively, we would like to assign a KB constant that has high similarity with words the slot is aligned to. Thus, we define s local of a KB constant c i for every slot y i to be its expected semantic similarity under the alignment distribution: We define sim φ (x j , c i ) to be the cosine similarity between the embedding φ(x j ) and the embedding φ(c i ) (scaled to the range [0,1]), where φ(c i ) is defined to be the average embedding of all words in L(c i ), that is, Global Score Utilizing only a local scoring function raises several concerns. First, slots are treated independently and dependencies between slots are ignored, which might result in a final logical form that is globally inconsistent. For example, we could generate the logical form Birthplace.ComputerScience, which is semantically dubious. Second, some KB constants do not align to any word in the utterance and appear in the logical form only implicitly. For example, the logical form in Figure 1 contains the Field relation, however "field" is implicit in the utterance. Therefore, we define exe K (z) to be true iff z executes against K without errors, and define a global score that prevents assignments c that result in a logical form z such that exe K (z) is false. Moreover, we can use similar constraints to prevent logical forms that are highly unlikely according to our prior knowledge. Specifically, we define once(z) to be true iff each date, named entity, and number in the logical form z appear exactly once. We then define a global score that prevents logical forms in which once(z) is false. Empirically, we find such assignments to be mostly wrong (e.g., Type.Article (Field.QA Field.QA)).

Formally, our scoring function is defined as:
s global (c, z abs ) = 0 exeK(z abs |c), once(z abs |c) −∞ otherwise, where z abs | c is the result of assigning the KB constants c to the slots in z abs .
Inference Algorithm. While each local scoring function can be efficiently maximized independently, the global constraints that depend on the entire assignment c make inference more complicated. However, because the global scoring function introduces hard constraints, an exact and efficient inference algorithm is still possible. Our inference algorithm generates solutions one-by-one sorted by the local scoring function only. Then, it checks for each one whether it satisfies the global constraints defined by s global , and stops once a satisfying solution is found, which is guaranteed to maximize our scoring function. While in the worst case, this procedure is exponential in the size of c, in practice solutions are found after only a few steps. We also always halt after T steps if a solution has not been found.
Algorithm 1 describes the details of our inference procedure. We define cands to be a data structure that contains l lists of candidate KB constants (a list for each slot), sorted according to the local scoring function s local in descending order. Additionally, getAssign(cands, a) is a function that accesses cands, and retrieves the assignment with indices a. For example, getAssign(cands, {0} l ) retrieves the top scoring local assignment. Last, we define a inc(i) to be the indices a, where a i is incremented by 1.

Algorithm 1 Exact inference algorithm
Input: cands, T Output: c * -the top scoring assignment 1: horizon ← ∅ Max heap 2: ainit ← {0} l 3: push(horizon, ainit) 4: for t ← 1 to T do 5: a ← pop(horizon) 6: c ← getAssign(cands, a) 7: if sglobal(c, z abs ) = 0 then 8: return c 9: for i ← 1 to l do 10: push(horizon, a inc(i) ) 11: return N U LL The algorithm proceeds as follows. First we initialize a maximum heap horizon into which we will dynamically push candidate assignments. Then, we iteratively pop the best current assignment from the heap, and check if it satisfies the global constraints. If it does, we return this assignment and stop. Otherwise, we generate the next possible candidates, one from each list (there is no need to add more than one because candidates are sorted). If no satisfying assignment is found after T steps, we return N U LL. It is easy to show that when the algorithm returns an assignment it is guaranteed to be the one that maximizes our global scoring function.

Experimental Setup
Data We evaluated our method on the OVERNIGHT semantic parsing dataset, which contains 13, 682 examples of language utterances paired with logical forms across eight domains, which were chosen to explore diverse types of language phenomena. As described, our approach depends on having linguistic regularities repeat across domains. However two domains contain logical forms that are based on neo-davidsonian semantics for treating events with multiple arguments. Since such logical forms are completely absent in six domains, it is not possible for our method to generalize to those in our zero-shot approach. Therefore, we do not evaluate on the BASKETBALL domain, in which 98% of the examples contain such logical forms, and omit all examples (68%) that contain such logical forms in the SOCIAL domain. We evaluated on the same train/test split as Wang et al. (2015), using the same accuracy metric, i.e., the proportion of questions for which the denotations of the   predicted and gold logical forms are equal. We additionally used the lexicon L they provided with descriptions for KB constants.

Evaluated Models
We evaluated different models (Table 3) according to the following two attributes. Firstly, whether the model is trained on target domain data (in-domain) or on source domains data only (cross-domain). Secondly, we trained the neural semantic parser described in Section 2 over the lexical data representation (lexical), or in comparison trained our model over the abstract representation (abstract).
As CROSSLEX can not generate KB constants unseen during training, we additionally implemented CROSSLEXREP. In this model, we added an additional step that modifies the output of CROSSLEX: we replaced a generated KB constant with its most similar KB constant from the target KB that also shares its abstract type.

Implementation Details
In all experiments, for our embedding function φ(·), we used pre-trained GloVe (Pennington et al., 2014) vectors with dimension 300. In a single experiment we considered one domain as the target domain, while other domains were the source domains (and repeated for all domains). For INLEX, CROSSLEX and CROSSLEXREP we used exactly the same experimental setup as Jia and Liang (2016). For our zero-shot model, we used 20% of the training data as a development set for tuning hyper-parameters. We first tuned parameters for the structure mapper, and used the best setting for tuning the aligner.
We provide the list of hyper-parameters and their values for our zero-shot framework. Structure mapper: number of epochs (22, using early stopping), hidden unit dimension (300), word vector dimension (100), learning rate (0.1 with SGD optimizer), L2 regularization (0.001). At test time, we used beam search with beam size 5, and then picked the highest-scoring logical form that we could infer an assignment for. Aligner: number of epochs (30, using early stopping), hidden unit dimension (250), word vector dimension (100), learning rate (0.0002 with Adam optimizer), dropout rate over hidden states (0.4). For both models, word vectors are updated during training. Inference: we used T = 500 steps, after which we halted.

Results
We trained all models above and evaluated on the test set for all seven domains. Results show (Table 2) that ZEROSHOT substantially outperforms other zero-shot baselines. CROSSLEX performs poorly, as it can only generate KB constants seen during training. CROSSLEXREP performs better, as it can generate KB constants from the target domain, however, generating the correct constant usually fails. This highlights the challenge in the zero-shot semantic parsing setting.
For baselines trained on target domain data, IN-LEX (re-implementation of (Jia and Liang, 2016)) achieved average accuracy of 74.8, which is comparable to the 74.4 average accuracy they report on our seven domains. Training on the target domain with our method INABSTRACT achieved 58.5% average accuracy, which shows that while the abstract representation in our framework loses some valuable information, it is still successful. Importantly, the performance of ZEROSHOT (53.4%) is only slightly lower than INABSTRACT, showing that our model degrades gracefully and generalizes well across domains compared to CROSSLEX.

Model Ablations
We now measure the effect of different components of our framework on denotation accuracy. We examined the effect of removing components completely, or replacing them with simpler ones. Thus, the following ablated models can be viewed as additional baselines.   (z)). Table 4 shows that ablating each of the components hurts performance. Discarding our two main technical contributions results in 31.2% accuracy compared to 54.5% in the full model. Performing inference with global constraints dramatically improve performance, showing that using the alignment model alone results often in incoherent logical forms. Our dedicated aligner also improves performance compared to alignments learned by the decoder of the structure mapper. This is pronounced without global constraints (a drop from 42.8% to 31.2%), but is less severe when global inference is used (a drop from 54.5% to 48.8%).
Intrinsic Analysis While we evaluated performance above via denotation accuracy, we now evaluate our framework's modules with different metrics (on the development set). We evaluated the structure mapper by measuring the exact match of the top candidate in the beam to the gold abstract logical form (49.1%). We further evaluated the aligner by measuring alignment accuracy for top candidate alignments, in comparison to the unsupervised aligner output (72.9%).
Finally, we measured inference performance in the following ways. The fraction of cases where inference succeeded within T steps is 70% (as some predicted abstract logical forms are not valid in terms of their syntax), and the average number of steps in case of success (3.67 steps). In addition, the fraction of correct global assignments given an abstract logical form that exactly matches the gold one is 77.0%. To conclude, results show that the structure mapping problem is harder than slot filling, for which we learned good alignments and performed fast and mostly accurate inference.
Valuation To estimate the value of our zeroshot framework in terms of target domain examples, we plot a learning curve ( Figure 5) that shows development set average accuracy for IN-LEX (trained on target domain data). In comparison, ZEROSHOT utilizes no target domain data, thus it is fixed. As Figure 5 shows, our framework's value is equal to 30% of the target domain training data. In our setting this equals to 400 examples manually-annotated with full logical forms. Note that this value is gained every time a semantic parser for a new domain is needed. Moreover, our parser can be used as an initial system, deployed to begin training from user interaction directly.
Limitations We now outline some of the limitations of our approach for zero-shot semantic parsing. We hypothesized that language regularities repeat across domains, however as mentioned above, neo-davidsonian semantics occurs mostly in one domain in the OVERNIGHT dataset and thus we were not able to generalize to it. Our parser also obtained low accuracy in BLOCKS. This domain contains mostly spatial language, different from other domains in OVERNIGHT. Specifically, prepositions, which we did not lexicalize map to relations in the KB (e.g., "below" and "above" map to the relations Below and Above). This shows the challenge involved in decomposing the structure from the lexicon with rules. In addition, since some spatial relations in this domain are semantically similar (Length, Width and Height), we found it hard to rank them correctly during inference. This stresses that in our framework, we assume KB constants to be sufficiently distinguishable in the pre-trained embedding space, which is not always the case.

Related Work
While zero-shot executable semantic parsing is still under-explored, some works focused on the open-vocabulary setting which handles unseen relations by replacing a formal KB with a probabilistic database learned from a text corpus (Choi et al., 2015;Gardner and Krishnamurthy, 2017).
Our abstract utterance representation is related to other attempts to generate intermediate representations that improve generalization such as dependency trees (Reddy et al., 2016), syntactic CCG parses (Krishnamurthy and Mitchell, 2015), abstract templates (Abujabal et al., 2017;Goldman et al., 2018) or masked enitites (Dong and Lapata, 2016). Our abstract logical form representation is similar to that Dong and Lapata (2018) used in to guide the decoding of the full logical form. The main difference with our work is that we focus on a comprehensive abstract representation tailored for zero-shot semantic parsing.
It is worth mentioning other work that inspected various aspects of zero-shot parsing. Bapna et al. (2017) focused on frame semantic parsing, and assumed that relations appear across different domains to learn a better mapping in the target domain. Also in frame semantic parsing, Ferreira et al. (2015) utilized word embeddings to map words to unseen KB relations. Finally, Lake and Baroni (2017) inspected whether neural semantic parsers can handle types of compositionality that were unseen during training. The main difference between their work and ours is that we focus on a scenario where a compositional logical form is generated, but the target KB constants do not appear in any of the source domains.

Conclusion
In this paper we address the challenge of zeroshot semantic parsing. We introduce a model that can parse utterances in unseen domains by decoupling structure mapping from lexicon mapping, and demonstrate its success on 7 domains from the OVERNIGHT dataset.
In future work, we would like to automatically learn a delexicalizer from data, tackle zero-shot parsing when the structure distribution in the target domain is very different from the source domains, and apply our framework to datasets where only denotations are provided.