Reversibility reconsidered: finite-state factors for efficient probabilistic sampling in parsing and generation

We restate the classical logical notion of generation/parsing reversibility in terms of feasible probabilistic sampling, and argue for an implementation based on ﬁnite-state factors. We propose a modular decomposition that reconciles generation accuracy with parsing robustness and allows the introduction of dynamic contextual factors.


Introduction
The objective of Natural Language Understanding (NLU) is to map linguistic utterances to semantic representations, that of Natural Language Generation (NLG) to map semantic representations to linguistic utterances. In most of NLP practice, these two objectives are handled by different processes, and computational linguists rarely operate at the intersection of the two subdomains.
For a few years around the early nineties, based both on cognitive, linguistic, and engineering considerations, there was a surge of interest in so called reversible grammar approaches to NLP, where one and the same grammatical specification could serve both for parsing utterance x into logical form z, but also for generating x from z (Strzalkowski, 1994).
We start by a brief review of this historical nonprobabilistic notion of reversibility and point out certain of its weaknesses, in particular regarding robustness; we then give in section 3 a new probabilistic definition of reversibility; then, in section 4 we argue for a reversibility model based on modular weighted finite-state transducers. We end with a discussion of recent related work. * Work done while at XRCE.

Classical reversibility
The most direct approaches to NLU attempt to design procedures for semantic parsing that, given an input utterance x, produce a semantic representation z, by following a number of intermediate steps where the surface form is gradually transformed into semantic structure. Such "procedural" approaches to semantic parsing are typically very hard or impossible to invert: starting from a semantic representation z, there is no simple process that is able to find an x which, when given to the parser, would produce z. Formally, a Boolean relation r(x, z) can be such that the question ?∃z r(x, z) is decidable for all x's, while the reciprocal question ?∃x r(x, z) is undecidable for some z's (Dymetman, 1991). 1 One of the motivations for the emerging paradigm of unification grammars at the end of the eighties was the clean separation they promised between specifying well-formed linguistic structures, both on the syntactic and semantic levels, through a formal description of the relation r(x, z), and producing efficient implementations of the specification; in particular, there was much hope that such formalisms would be conductive to effective reversibility (by contrast to variable assignment, variable unification is inherently symmetrical), that is, to feasible (and if possible efficient) implementations of the parsing problem r(x, ?) and of the generation problem r(?, z).
To some extent, this hope was validated through a number of works at the time, mostly involving machine translation applications, and constraining in more or less explicit ways the specification of r (van Noord, 1990). However, for the non-statistical approaches to parsing then strongly dominant, robustness was an issue: a parser had to either accept or reject a given input x, with no intermediary options, and in order to be able to parse actual utterances, with all their empirical diversity, parsers had to be rather tolerant. In the procedural view of parsing, such robustness issues could often be mitigated through engineering tricks such as ordering the rules from strict to lax, where grammatical constructions were given preference over less conventional ones; however, when trying to move to reversible grammars, these tricks could not be reproduced: if the grammar was able to parse an x into z, then, by design, it was also able to generate x from z, and there was no obvious way, in these non-probabilistic approaches, to distinguish between producing a linguistically correct x or producing a deviant or incorrect one.

Probabilistic reversibility
In the classical non-probabilistic case, a (relative) consensus existed around the fact that a reversible grammar should be, as we indicated above, a formal specification of the relation r(x, z) such that the problems r(x, ?) and r(?, z) were effectively solvable.
Transposing this to the probabilistic world, we propose the following semi-formal Definition: A probabilistic reversible grammar is a formal specification of a joint probability distribution p(x, z) over logical forms z and utterance strings x such that the conditional distributions p(z|x) def = p(x,z) z p(x,z ) (parsing) and p(x|z) def = p(x,z) x p(x ,z) (generation) can be efficiently sampled from. 2 Why such focus on sampling? We could have chosen other definitions of parsing (and similarly for generation), for instance the ability to return the most probable z given x, i.e. to return argmax z p(z|x); however sampling is the most direct way of providing a concrete view of the underlying probabilistic distribution, and has many applications to learning, so we think the definition above is reasonable (see also footnote 4 ).

Finite-state models for reversibility
Finite-state transducers have properties which make them uniquely suited to implementing reversible linguistic specifications in the above sense. Consider a simple weighted string-tostring transducer τ (s, t), where s, t are strings, and where the underlying semiring is the "probabilistic semiring" over the nonnegative reals, addition and multiplication having their usual interpretations. Such a transducer preserves regularity, both in the forward (resp. reverse) directions, meaning that the image through τ of any weighted regular language over s (resp. over t) is again a weighted regular language over t (resp. over s). In particular the forward (resp. reverse) image of a fixed string s 0 (resp t 0 ) can be computed in a compact form as a weighted finite-state automaton (FSA) over t (resp. s), which we can denote by τ (s 0 , ·) (resp. τ (·, t 0 )). A weighted FSA can be easily normalized into a probabilistic FSA 3 and, from this probabilitic FSA exact samplers for the "parser" τ (s 0 , ·) and for the "generator" τ (·, t 0 )) are directly obtained. 4 In general, some of the properties that make weighted FSAs and FSTs -over strings or trees -specially relevant for probabilistic models of language are the following: (i) they allow compact representations of complex probability distributions over linguistic objects (automata) or pairs of linguistic objects (transducers), (ii) they permit efficient exact sampling (and efficient optimization over derivations (but not always over strings)), (iii) they support modularity: intersection of automata, composition of transducers, projections of an automaton through a transducer. 5 Conceptual architecture Armed with these general considerations, let us now propose a conceptual architecture based on a small number of 3 That is, into a weighted FSA such the weights of the transitions from each state sum to 1. 4 While sampling strings from a weighted finite-state automaton is simple, finding the most probable string (not path) in a probabilistic FSA is an NP-hard problem (Casacuberta and de la Higuera, 2000), and one has to resort to the socalled Viterbi approximation (assuming that the most probable path projects into the most probable string). Contrary to popular belief, sampling can sometimes be simpler than optimization. 5 Outside of the realm of finite-state machines, this modularity is typically impossible to obtain. Thus, in general, the availability of a sampler for a distribution p(x) (resp. a distribution q(x)) does not imply that we can efficiently sample from the product (i.e. intersection) p(x).q(x), but we can in case p and q are both represented by weighted FSAs. finite-state modules, which attempts to satisfy the definition given above for probabilistic reversibility, to address the problem of robustness that we described earlier, and can also support contextual preferences. We illustrate the approach with some simple examples of human-machine dialogues (between a customer and a virtual agent), a domain for which reversibility has high relevance, due to effects such as self-monitoring (Neumann, 1998;Levelt, 1983), interleaving of understanding and generation (Otsuka and Purver, 2003), and lexical entrainment (Brennan, 1996). The conceptual architecture is shown in Figure 1. Formally, the figure represents a probabilistic graphical model in so-called factor form, where the factors are ω, κ, σ, λ (we have also indicated for future reference the "contextual" factors ζ, µ, that we ignore for now). The factors take as arguments three types of objects: z is a logical form, that is, a structured object which can be naturally represented as a tree, x is a surface string, and y is a latent "underlying" string that corresponds to one of a small collection of "canonical" texts for realizing the logical form z (more about that later). Each factor is realized through a weighted finite-state machine (acceptor or transducer) over strings or trees (Mohri, 2009;Fülöp and Vogler, 2009;Maletti, 2010;Graehl et al., 2008).
The λ factor is a string automaton that represents a standard ngram language model (typically specific to domain), in other words a probability distribution over utterances x. Symmetrically, the regular tree automaton ω represents a distribution over logical forms z, which can be seen as playing a similar role to the language model, but at the semantic level, namely telling us what are the possible/likely logical forms in a certain domain. 6 The "canonical factor" κ is a weighted treeto-string transducer (Graehl et al., 2008), which implements a relation between logical forms z and a small number of latent "canonical" texts y realizing these logical forms. For example, κ may associate the logical form (dialog act) z = wad(batLife, iphone6) -with wad an abbreviation for "what is the value of this attribute on this device?", and batLife an abbreviation for "battery life" -, with such a canonical text (among a few others) as: What is the battery life of the Iphone 6?.
The "similarity factor" σ is a weighted stringto-string finite state transducer which gives scores to x, y according to a notion of similarity. It has the role of "bridging" the gap between the actual utterances x and the latent canonical utterances y. The intention behind the similarity factor is to "decouple" the task of modeling some possible realizations of a given logical form from the task of recognizing that a given more or less well-formed input is a variant of such a realization. This factor relates the two strings y and x, where y is a possible canonical utterance in the limited repertory produced by κ, and x is an actual utterance, in particular any utterance that could be produced by a human speaker. So for instance suppose that the user's utterance is x = What about battery duration on this Iphone 6?, we would like this x to have a significant similarity with the canonical utterance y = What is the battery life of the Iphone 6? but a negligible similarity with another canonical utterance such as y = What is the screen size of the Galaxy Trend?.
Overall, the canonical factor κ(z, y) concentrates more on a core "generation model", namely on producing some well-formed output y from a logical form z, while the similarity factor σ(y, x) allows relating an actual user input x to a possible output y of the κ model. The main import of σ is then to allow to use the core generation model defined by κ to be exploited for robust semantic parsing.
Different instantiations of this scheme can be employed. In some preliminary experiments that we have performed, 7 σ is a simple edit-distance transducer (Mohri, 2003) which penalizes differently the discrepancies between x and y: strongly for some salient content words or named entities of the domain, weakly for less relevant content words and for non-content words, with limited use of local paraphrases (which can also be implemented through σ). This strategy seems to work reasonably well when the semantical repertory of the domain is restricted, because a large number of possible variants for x are "attracted" to the same underlying semantics. In domains where small nuances of expression may result in distinct semantics, the division of work between κ and σ may be different.

Parsing and Generation
To understand the reversibility properties of the model of Figure 1, let us first simplify the description by assuming that z, instead of being a tree, is actually a string. Then both ω and λ are string automata, and both κ and σ string-to-string transducers. Such a specification satisfies our definition of probabilistic reversibility, exploiting well-known compositionality properties of weighted finite-state machines over strings (Mohri, 2009). For parsing, we start from a fixed x 0 , and can project it through σ into a weighted FSA over y; in turn we can project this automaton onto an FSA over z, and finally intersect this automaton with ω, obtaining a final weighted "x 0 -parser" automaton over z, representing a probability distribution from which we can draw exact samples as explained above. 8 Generation works in exactly the reverse way, starting from a z 0 and eventually building a "z 0 -generator" automaton over x.
In the actual proposal, z is a tree, meaning that ω is a tree automaton, and κ a tree-to-string transducer. While finite-state tree automata correspond to a single concept, and share all the nice properties of string automata (Comon et al., 2007), the situation with tree-to-tree or tree-to-string transducers is more complicated (Maletti, 2010;Graehl et al., 2008): several variants exist, only some of which support the operations that our conceptual model requires (composition with the string transducer σ and intersection with the tree automaton ω). In particular, the "linear non-deleting topdown tree transducers" defined in (Maletti, 2010) 9 have the requisite properties.

Contextual factors
We now briefly come back to the factors ζ (tree automaton) and µ (string automaton) of Figure 1, which highlight the use-fulness of our modular finite-state architecture. These factors play similar roles to ω and λ, but they evolve dynamically with the context. In dialogue applications, utterances can often only be interpreted by reference to the current dialogue state (e.g. "ten hours" in the context of a question about battery life), and the ζ factor can be used as a compact representation of the current expectations of the dialogue manager about the next logical form, to be combined with the actual customer's utterance. Symmetrically, the µ factor can be used to represent such phenomena as lexical entrainment (Brennan, 1996), where the agent's utterance is oriented towards using similar wordings to the customer's.

Related work
The unique formal properties of finite-state machines, which favor modular decompositions of complex tasks, have long been exploited in Computational Linguistics. Tree transducers in particular have gained popularity in Statistical Machine Translation, starting with (Yamada and Knight, 2001), as described in the surveys (Maletti, 2010;Razmara, 2011).
The reversibility properties of finite-state transducers have been exploited to a more limited extent, starting with applications of non-weighted string-to-string transducers to morphological analysis and generation (Beesley, 1996).
Concerning the application of weighted finitestate tree machines to NLU/NLG reversibility, our proposal is strongly related on the one hand to the approach of (Jones et al., 2012), who explicitely proposes tree-to-string transducers as a tool for modelling semantic parsing and for training on semantically annotated data, and on the other hand to (Wong, 2007;Wong and Mooney, 2007), who focus more directly on the problem of inverting a semantic parser into a generator. Wong et al. do not explicitely use tree-based transducers, but rather a formalism inspired by SCFGs (synchronous context-free grammars), which essentially corresponds to a form of tree-to-string transducer. In relation to reversibility considerations, presentations in terms of synchronous formalisms have the interest that they are intrinsically symmetrical. Such formalisms have tight relations to tree-transducers (Shieber, 2004); one recently proposed generalization, "Interpreted Regular Tree Grammars" (Koller and Kuhlmann, 2011), allows multiple (possibly more than two) synchronized views of an underlying abstract derivation tree, and has the advantage of permitting a uniform treatment of strings and trees.
One important aspect in which our proposal differs from these previous approaches is in proposing to decouple the "core" task of mapping logical forms to well-formed latent canonical realizations from the task of relating these realizations to actual utterances, through an additional "similarity" transducer acting as a bridge.
This idea of a bridge is however close to another line of work in semantic parsing, not transducer based, namely (Berant and Liang, 2014;Wang et al., 2015). There, a simple generic grammar is used to generate canonical realizations from a repertory of possible logical forms (expressed in a variant of lambda calculus). Given an input to parse, simple heuristics are used to select a finite list of potential logical forms which are then ranked according to the (paraphrase-based) similarity of their associated canonical realization with the input. Thus in this approach, a form of generation plays an important role, not for its own sake, but as a tool for semantic parsing.

Conclusion
Because of their unique compositional properties, finite-state modules are a natural choice for implementing our definition of reversibility as efficient bidirectional sampling from a common specification. In this piece we have argued in favor of an architecture realizing this definition and displaying robustness and contextuality.