Learning Latent Semantic Annotations for Grounding Natural Language to Structured Data

Previous work on grounded language learning did not fully capture the semantics underlying the correspondences between structured world state representations and texts, especially those between numerical values and lexical terms. In this paper, we attempt at learning explicit latent semantic annotations from paired structured tables and texts, establishing correspondences between various types of values and texts. We model the joint probability of data fields, texts, phrasal spans, and latent annotations with an adapted semi-hidden Markov model, and impose a soft statistical constraint to further improve the performance. As a by-product, we leverage the induced annotations to extract templates for language generation. Experimental results suggest the feasibility of the setting in this study, as well as the effectiveness of our proposed framework.


Introduction
The meaning of natural language should always be accompanied by a context. Grounded language acquisition aims at learning the meaning of language in the context of an observed world state. A solution framework typically addresses the following subproblems: segmenting the text into meaningful phrasal units, determining which world state information is being referred to, and finding proper alignments from these units to the events of values in the world state.
The task has attracted much attention from the NLP community with a special focus on aligning text descriptions onto processed, structured event records (Snyder and Barzilay, 2007;Liang et al., 2009;Hajishirzi et al., 2011). Various statistical models have been proposed, attempting at * Contribution during internship at Microsoft Research Asia. 1 Our implementation is available at https: //github.com/hiaoxui/D2T-Grounding.  characterizing the interaction between text spans and categorical values (e.g., direction='East') or strings (e.g., person names). The previously addressed term semantic correspondences narrowly describes the process of aligning natural language spans to different data fields. However, there still exists a gap between alignment results and the underlying semantics. People tend to use various phrases to describe information that are inferred from different amounts of numerical values in a data field, or values derived from additional operations over fields. Consider the example description for a basketball game shown in Figure 1. The phrase edged out in the first sentence implies the fact that the Toronto Raptors had beaten their opponent by a relatively narrow margin. This could only be derived from an operation of subtraction between two scores that correspond to the field PTS for both teams in the event table, which leads to a relatively small difference of only four points. Previous efforts on learning semantic correspondences relying on categorical distributions (Liang et al., 2009) or string pattern features (Hajishirzi et al., 2012;Koncel-Kedziorski et al., 2014) do not have the capability to accurately capture numerical information, especially for the part that does not appear explicitly in the table and needs to be inferred. Such kind of language grounding is important both for natural language understanding and for natural language generation. For language understanding, establishing explicit connections between symbols and values beyond ungrounded symbolic meaning representations will be useful for acquisition and inference of numerical commonsense (Narisawa et al., 2013). For language generation, properly aligned information is the key to acquiring patterns of various lexical choices under different world states (Roy and Reiter, 2005).
In this work, we make a step towards more explicit semantic correspondences between structured data and texts. Rather than only producing coarse alignments between data fields and text spans, we try to detect the latent semantics underlying these alignments by prompting explicit semantic annotations. We make the first attempt at utilizing publicly available datasets originally prepared for data-to-text language generation to produce such annotations for words and phrases in natural language without additional supervision.
Specifically, we conduct our study on a recently released dataset of descriptions for NBA basketball games with structured tables of game records. Different from a few popular datasets that have been well conjectured to be produced from rules (Reiter, 2017), the summaries are all written by humans. The text contains some numbers and proper nouns, which are easier to establish correspondence with data. However, the majority of texts contain many informative words, some of which need to be inferred indirectly from various types of values in the data cells. We want to establish explicit correspondences for them.
We derive a set of semantic labels from original data fields (Sec 4.1). These labels could be executed to establish direct correspondences to one or more values in the structured table. No annotation on the original dataset means unsupervised learning from weak distant supervision should be conducted. We design a semi-hidden Markov model to address this problem (Sec 4.2), which could align a semantic label to a word span. In the model, we leverage continuous probability distributions to model the correspondences between numerical values and lexical terms, which has not been well captured in previous work. To address the emerged issue of "garbage collection" that commonly appears in statistical alignment models (Sec 4.5), we add a soft statistical con-straint via posterior regularization . As a by-product, we also show how the derived semantic annotations could be used to induce descriptive templates for data-to-text generation (Sec 5). Experimental results (Sec 6) suggest the feasibility of the setting in this study, and show the effectiveness of our proposed framework.

Related work
Grounded language acquisition has aroused wide interest in various disciplines (Siskind, 1996;Yu and Ballard, 2004;Gorniak and Roy, 2007;Yu and Siskind, 2013;Chrupała et al., 2015). Later work in the community of natural language processing also moved in this direction by relaxing the amount of supervision to enable a model to learn from ambiguous alignments (Kate and Mooney, 2007;Chen and Mooney, 2008). Some research aimed at establishing coarse alignments between simulated robot soccer game records and commentary sentences (Chen and Mooney, 2008;Chen et al., 2010;Bordes et al., 2010;Hajishirzi et al., 2011). For weather forecast domain, Liang et al. (2009) used a hierarchical hidden Markov model in order to map utterances to world states, which coped with segmentation and alignment together. More recently, Koncel-Kedziorski et al. (2014) tried to obtain the correspondences between real commentaries and structured football (soccer) events in multiple resolutions. We are distinct from this line of work in the fact that we aim at producing explicit semantic annotations that could capture information from structured tables. To achieve this goal, we need additional scaling or operations to enable data fields and values to be faithfully mapped onto texts. This will address the issue of the lack of consideration for the relationship between lexical terms and numerical values. Our approach makes a significant difference in that our framework could generalize to numerical values or value combinations that are unseen in training, and will not be simply reciting cooccurrence patterns of exact values in the training data.
Our work relates to learning executable semantic parsers under weak supervision. Early semantic parsing started from fully supervised training with annotated meaning representations available (Zettlemoyer and Collins, 2005;Ge and Mooney, 2006;Snyder and Barzilay, 2007), but more recent work focused on reducing the amount of supervision required (Artzi and Zettlemoyer, 2013). The intuition behind weakly supervised executable semantic parsing is that once the latent semantic representation has been executed, one could test whether the execution results could match the information with available weak supervision signals such as answers to natural language queries (Clarke et al., 2010;Liang et al., 2011), or task completion from instructional navigations (Misra et al., 2017). Such formulations have been adapted for question answering over structured tables (Pasupat and Liang, 2015;Krishnamurthy et al., 2017). However, the current research focus is to convert a natural language question into executable table queries and to directly retrieve results. They do not have the need of inference involving numerical commonsense implied by various lexical patterns. A few unsupervised approaches exist (Poon and Domingos, 2009;Poon, 2013) but only specific to translating language into queries in the highly structured database and cannot be applied to our domain.
Our approach is implemented as assigning tag annotations over text spans, which is conceptually related to fine-grained named entity tagging (Ling and Weld, 2012). Our setting only requires a rather weak and distant form of supervision from paired tables and texts without annotations for fine-grained alignments between phrases and data cells. Similar modeling and learning strategies could potentially be useful for considerably large tag space derived from structured knowledge bases in the future (Choi et al., 2018).
The feasibility of this work is partly due to the availability of data, mostly comes from the field of data-to-text language generation. Related work in data-to-text generation mainly focused on directly generating summary descriptions for structured data (Mei et al., 2016;Kiddon et al., 2016;Murakami et al., 2017;Wiseman et al., 2017), without establishing underlying semantic correspondences. Texts generated thereby can be fluent but not conforming to the input data, unlike templatebased approaches where lexical choices could be directly controlled. In our work, we find the derived semantic correspondences between data and texts to be useful for template induction, either with simple heuristics to automatically extract description patterns (how to say) and corresponding triggers (what & when to say), or with more crafted discriminative learning approaches (cf. Angeli et al. (2010)).

Technical overview
Task Let S be the set of all world states, W be the set of all texts, O be the set of all executable operators, and V be the output space of O. A world state s ∈ S is a table storing some information, or more specifically in this work, a tabular recording for a sports game. An operator o ∈ O can be executed on a world state to retrieve values, i.e., each o could be treated as a mapping of S → V. The result of an operation can be a string, a continuous values or a discrete value. Meanwhile, each world state s is accompanied with a piece of description w ∈ W. Here w consists of a sequence of word tokens {w i ∈ w}. We further define a segmentation variable π, which could convert w into a sequence of word spans c, containing each span of tokens c t ∈ c that could be interpreted as a phrase. Note that we use superscript t to denote indices of phrases, and subscript i to denote indices of individual words. We further define l as a sequence of latent labels, and value of each l t is an operator o i ∈ O. 2 For each world state s and corresponding description w, we want to jointly find a proper segmentation π to obtain c, and assign labels to every word span c t .
We conduct this study on the ROTOWIRE subset of the openly available dataset released by Wiseman et al. (2017), containing text descriptions for NBA basketball games with structured tables of game statistics. Take Figure 1 as an example. The proper nouns (e.g. Toronto Raptors) appeared in the sentence can be assigned with a tag Team Name in our tag set, which could then be aligned to the Team Name field in the table. Some numbers appearing in the text, such as 15 in the example, can be assigned with the label Team Losses, with the executable annotation to extract the number of previously lost matches of the mentioned team. What we are more interested in is where the phrase edge out comes from. We are aiming at a model which is capable to capture the semantics behind the phrase edge out, which is used to describe an event that one team has beaten the other with close scores. In our annotation scheme, this phrase should be assigned with the tag Team Points Delta, and executing that will return the score difference between two teams.
The task is challenging in that there does not  exist any other additional supervision signal. The learning process will mostly rely on statistical cooccurrences of information between the structured data and its text descriptions. Note that we assume a consistent structure (schema) throughout the whole dataset upon which the learning process will be performed.
Model To jointly learn word segmentations c and latent semantic annotations l between world state s and text w in a unified framework, we propose a generative model to characterize the joint distribution P s (l, π, w; θ), parameterized by θ.
Learning The data contain paired texts and tables only, thus our model must learn segmentations and latent semantic annotations in an unsupervised fashion. The target is to maximize the complete data likelihood where D represents the whole training data. To reduce the search space in inference and to capture some patterns of content planning in the text descriptions, we adopt a Markov assumption over phrase segments, which leads to a hidden semi-Markov model (semi-HMM) (Murphy, 2002;Sarawagi and Cohen, 2005). The key part is to characterize different types of correspondences (Sec 4.2). We derive an expectation-maximization (EM) algorithm to perform maximum likelihood estimation, and introduce a soft statistical regularization to guide the model towards a better solution (Sec 4.5). Inference Once the model has been trained, we use a Viterbi-like dynamic programming process to perform MAP inference to segment the texts and to assign the most likely tags for each span.

The set of annotations
We describe the process of how we derived our set of semantic annotations here. There are two kinds of specific tables for each NBA game in the dataset: Box-Scores and Line-Scores, respectively showing the performance statistics for individual players and the whole teams.  Table 1 and leave the entire tag set to Appendix A. Along with all these tags derived from the original data fields, we also include a special NULL tag which are supposed to be assigned to non-informative words or words containing information not contained in the given table.
Note that we impose little prior knowledge in this step. We simply over-generate all possible labels, and let the model figure out which part of them should be eventually used. Although the only compositional operation we used in this work is numeric subtraction, common operations that could produce string, categorical values or numbers could be easily introduced for other domains.

Semi-HMMs with continuous values
As previously mentioned, we will be modeling the joint distribution of word segmentation c and the latent semantic annotations l between paired world state s and text w, which could be factorized as: P s (l, π, w) = P s (w, π|l) · P s (l), and we write P s (w, π|l) as P s (c|l). In this section, we focus on the probability of the alignments between word spans and labels, namely, P s (c|l).
Following Liang et al. (2009), we consider two aspects. One is salience that captures the intuition that some fields should be more frequently mentioned than others (henceforth some latent tags should be more frequently triggered). The other is (local) coherence, which refers to the order in which the writer mentioning certain information tends to follow some patterns. To capture these two phenomena, we define a Markov model: where l t is the annotated label at time stamp t, and we assume that the transition probabilities are independent of world state s. It resembles a standard form of HMMs, despite the subscript s in P s (c|l). For different types of correspondences between l t and c t , we define different probability distributions to model P s (c t |l t ): (1) Numerics-to-numerics: The numbers in texts could sometimes be inaccurate due to some rounding customs, thus we use a Gaussian model for this type: where N is the Gaussian density. When the output type of tag l t is numeric and the word span c t is a number, we set P s (c|l) = Sof tIndicator(c, v l |σ l ), where σ l is different for different tags, and v l is the corresponded value in the table for tag l. Note that when σ → 0, Sof tIndicator reduces to an indicator function that only allows exact matching.
(2) String-to-string: Similarly for strings, since simple matching could fail if the text contains Bob to refer to Bob Smith. We simply use string matching to model the probability: P s (c|l) ∝ M atch(v l , c), where the M atch function returns the number of shared words between cell value v l and word c.
(3) Category-to-string: For labels correspond to discrete categorical values, such as Sunday, PG (point guard, a basketball position), we adopt the same method used by Liang et al. (2009): using a multinomial distribution over all word spans for each possible category: where v l is again the output value of tag l. (4) Numerics-to-string: When the tag correspond to a numeric value v l while the word span c is not a number, the problem resembles speech modeling (Huang et al., 1990). Applying the Bayes rule, we get: 4 where we collapse the relevant part from world state s into v l . The intuition behind P (v l |c, l) is that when an informative word (e.g. routed) appears in the text, the corresponding values should have different chances to happen in the world, e.g. P (30|routed, Points Delta) > P (3|routed, Points Delta).
Due to the lack of prior knowledge on the distribution, we also model this term as Gaussian. 5 The result resembles a Gaussian mixture: where P (c|l) = η c,l is also multinomial.

Modeling phrasal spans
Our model can enable phrase segmentation. Previously, Liang et al. (2009) treated the words inside a phrase individually and independently. This could be problematic in our scenario. For example, take down is used to describe a team defeating another, while separately both take and down are frequent words in general, making them difficult to be jointly assigned with the correct label as a whole. Instead, in our model we treat the phrasal word span as a unit. The probability is assigned to the whole span of phrase instead of individual words, which will break the token-wise Markov property (Fig. 2, henceforth Semi-HMM). For efficient parameter estimation, We use a variant of the standard forward-backward algorithm by adding a parameter k, which is the maximum length of word spans, onto the Markov chain. We leave detailed descriptions to Appendix B.

Skipping null labels
Preliminary experiments suggest that the initial model have too many words assigned to the NULL tag. Informative alignments may not be adjacent, which breaks the simplest Markov assumptions. In our model, the transition score of two non-NULL labels can be calculated by skipping all the NULLs in between, as shown in Figure 3. This is implemented without breaking the overall Markov property with the following trick used in earlier work on statistical alignments (Brown et al., 1993): Suppose we have m labels (i.e., m latent states), we can design m different NULL labels that share the same emission score, while preserving their original outward-transition probabilities. The types of NULL labels are inherited from the previous label. This might seem to be wasteful at first sight as we use two-fold latent states, but the Markov property is successfully preserved, therefore simplifying our implementation.

Posterior regularization
The structured tables and text descriptions of the dataset were originally crawled from different sources. As a consequence, a non-negligible proportion of texts is in fact describing information outside the given table, such as historical records (e.g., "win streak"). Ideally, words in these parts of the text should remain unaligned, or in our setting, be annotated with the NULL tag. However, due to the notorious effect of garbage collection from statistical alignment (Brown et al., 1993), these words tend to be aligned to some irrelevant fields in the table which are rarely mentioned. We address this issue by adding a soft statistical constraint in the form of posterior regularization . With posterior regularization, we could add certain types of statistical constraints to the E-step in the EM procedure, while keeping the inference tractable. The constraints should be in the form of: where the features f should be defined on local cliques for tractable inference. We use projected gradient descent to solve the E-step sub-problem in this work. The statistical constraint we add to the posterior is rather simple: For each sentence, we "encourage" at least a proportion of words to be aligned to NULL labels: (3) where r 0 is a adjustable ratio, n is the length of w. We also tried other constraints but found this simple soft regularization performing well.

By-product: template induction
Intuitively, the assigned semantic correspondences could be useful to derive templates and trigger rules for language generation. In this work, we use the most straightforward heuristics to perform template induction, utilizing the established correspondences and inferred parameters. Specifically, we first blank out the correspondences of numerics-to-numerics and string-to-string to be empty slots and replace with the tag names. In the example of Figure 1, we could replace Raptors with Team Name, and 120 with Team Points, if they have been correctly aligned. We also need to know when to use each template. We define a template trigger to be a quadruple (c, l, µ c,l , σ c,l ), where c is a phrase, l is a tag, µ c,l and σ c,l are estimated Gaussian parameters. 6 We assign each template with a score to be the minimum probability for all triggers inside: score(s, t) = min i N (t i .l(s); t i .µ, t i .σ), (5) 6 To use a unified notation, for categorical-value triggers we set µ c,l = arg max v l νc,v l (defined in (1)) and σ c,l = .
where t = {t i } denotes all possible triggers in the template, and the tag l can be executed over the world state s to retrieve a value l(s). We only consider sentences satisfying both of the following conditions in order to extract templates with high quality: (1) sentences aligned to ≤ two teams or one player, and (2) sentences with triggers derived from continuous distributions. Now that the templates and triggers are ready for use, we will experiment with the following straightforward rules to perform data-to-text generation: For every game, we first generate a sentence describing the scoreline result, followed by three sentences describing other information about team performance. While keeping that no template is repeatedly used, we will then choose the template with the highest score for top ten players sorted by their game points.

Experimental setup
We conducted experiments on the ROTOWIRE subset of the Wiseman et al. (2017) dataset. In our experiment, we restricted the maximum length of word span to two as a trade-off of speed and performance. Empirically, most of the phrases in the dataset are at a length of at most two. We empirically set the expected NULL ratio to be r 0 = 0.5.
We did the following pre-processing steps for all systems in comparison: we lemmatized all tokens in the sentences, and filtered out sentences containing less than five words since they are meaningless short sentences. To utilize the game dates, we converted them from calendar date to the day of week, e.g. 11/28/2016 is converted to Monday as a categorical value. Due to the huge noise in the ROTOWIRE dataset, containing many sentences irrelevant with their corresponding tables, we filtered out the sentences that contain no team or player names, or those that mention more than 2 players, as most of them are irrelevant texts.
Following Liang et al. (2009), we also used the parameters of a simpler model without Markov dependency (which was uniformly initialized) to initialize our complete model with obtained parameters, and then trained it until convergence. We adapted Liang et al. (2009)'s framework to the table schema in the ROTOWIRE dataset, and ran experiments accordingly as our baseline model.

Intrinsic evaluation
It is difficult to evaluate the accuracy of tag assignments for the entire dataset, since the tags are not annotated in the original data. We recruited three human annotators with familiarity in the domain of basketball games to label 300 sentences (around 8,000 tokens in total, and 30% of them are annotated with tags) from the test set. There exists a fraction (18%) where agreements were not made, we included all the proposed tags from the annotators to be correct. Also, because of the ambiguity of annotations, we use the base names of derived tags (e.g. Rebounds Delta) for numerics-to-string relationship evaluation. (e.g. Rebounds Delta is reduced to Rebounds) Finally, we calculated the precision and recall for non-NULL tag assignments at word-level.   The results are shown in Table 2. The Liang et al. (2009) framework could still achieve around 65% recall, because there exist a large proportion of correspondences that could be easily captured by exact matches and simple categorical distributions. Our model without PR achieves lower precision than the baseline, because the baseline did not model numerics-to-string relationships and encountered less severe issues of garbage collection. We can observe that our initial model indeed outperforms the baseline system in recall, while PR helps a lot to avoid distraction from irrelevant information that should be tagged as NULL.
We also include more fine-grained results for different types of correspondences, shown in Table 3. As expected, numerics-to-string correspon- The Boston Celtics ( 7 -5 ) blew out the Brooklyn Nets ( 2 -11 ) 120 -95 on Friday .  dences are the most difficult part in this study. Another notable thing is that although we found that around 40% of numerics-to-numerics correspondences were ambiguous due to the appearance of identical values from different table cells, our model could still achieve a high accuracy of 95.0%.

Extrinsic evaluation
We also tested how the derived templates could perform in language generation, when compared with the baseline using the same heuristics described in Sec 5. We report automatic metrics including BLEU scores and those based on relation extraction as proposed by Wiseman et al. (2017): precision & number of unique relations in generation (RG), precision & recall for content selection (CS), and content ordering (CO) score. These automatic metrics were designed for various aspects in NLG and may not all suit our main focus well, so we also conducted human evaluation on information correctness (1-5 scale ratings, the higher the better). We asked four human raters who are fluent in English and with familiarity in basketball terms to rate over outputs for 30 random games. Results are shown in Table 4. We can observe that templates derived from our model indeed outper-form those from the baseline. We put some inducted templates and generated text examples in the Appendix. Figure 4 shows some examples produced from our methods. Some of the alignments are meaningful, for example, the model assigned the word perfect with the annotation FT Percent, which represents the percentage of free throws. Without PR, our model performed poorly by aligning many common words to those rarely mentioned cells. In this example, the FT Made and FT Attempt fields in the input table both have the same value 8, making it difficult for a model without proper local coherence modeling to distinguish between them. Because our initial model without PR cannot annotate NULL correctly, the Markov transition between these two numbers was intercepted by three meaningless tokens. However, after injecting the PR constraint, most of the unmentioned words were successfully identified. The model captured the pattern that FT Attempt almost always follows FT Made, making it correctly assigned these two labels.

Analysis
We conducted ablation experiments for some  of the components (Table 5). When setting the maximum phrase length to be k = 1, the model degenerates to a normal HMM. The performance measured by F1-score drops for only a little. One possible reason is that Semi-HMMs tend to output some meaningless combinations of words as phrases, such as the phrase points on in Figure 4 (b), which could lead to many redundant annotations that hampers precision albeit its help to recall. We also tried to disable the transition probabilities during both training and inference, which led to lower precision and lower recall naturally as there was no modeling for local coherence. Finally, by canceling the NULL-skipping mechanism, we found that the numerics-to-numerics annotation accuracy dropped from 95.0% to 88.8%. Many of the spurious numerics-to-numerics annotations, such as the 8 -of -8 in Figure 4 (d), could be corrected using transition probabilities under the skipping-NULL mechanism (Figure 4 (c)).  One additional advantage of our model is that we can easily verify what the model has captured. For the latent annotation Team Points Delta, we sort its corresponding phrases by weights P (c|l) P (c) and we list the top 12 weighted words in Figure 5. We can observe that most of the displayed phrases have strong semantic relationship with score differences. More interestingly, we found the mean and variance values estimated by the Gaussian distributions rather informative. When l = Teams Points Delta, we observed that µ l,narrowly ≈ 2 while µ l,blow out ≈ 26. We could infer the conditions under which some phrases should be used, providing useful insights for lexical choices in language generation.

Conclusion
In this paper, we attempt to learn executable latent semantic annotations from paired structured tables and texts. We model the joint probability of data fields, texts, phrasal spans, and latent annotations with an adapted semi-hidden Markov model and impose a soft statistical constraint via posterior regularization. Experimental results suggest the feasibility of the setting in study and the effectiveness of our framework. This is a preliminary study for using weak supervision from structured data and texts to address the challenging problem of language grounding. For future study, one could collect large-scale data and texts in other domains where more complex grounding on phrases such as "increasing trends" should be done. To enhance modeling power, unsupervised discriminative models that utilize rich features (Berg-kirkpatrick et al., 2010) could also be explored. We are also interested in collecting more high-quality parallel data to induce grounded compositional logic representations.