NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints

Conditional text generation often requires lexical constraints, i.e., which words should or shouldn't be included in the output text. While the dominant recipe for conditional text generation has been large-scale pretrained language models that are finetuned on the task-specific training data, such models do not learn to follow the underlying constraints reliably, even when supervised with large amounts of task-specific examples. We propose NeuroLogic Decoding, a simple yet effective algorithm that enables neural language models -- supervised or not -- to generate fluent text while satisfying complex lexical constraints. Our approach is powerful yet efficient. It handles any set of lexical constraints that is expressible under predicate logic, while its asymptotic runtime is equivalent to conventional beam search. Empirical results on four benchmarks show that NeuroLogic Decoding outperforms previous approaches, including algorithms that handle a subset of our constraints. Moreover, we find that unsupervised models with NeuroLogic Decoding often outperform supervised models with conventional decoding, even when the latter is based on considerably larger networks. Our results suggest the limit of large-scale neural networks for fine-grained controllable generation and the promise of inference-time algorithms.


Introduction
Text generation applications often need to incorporate semantic constraints, i.e., what words should and shouldn't appear in the output generation. Consider the task of generating a recipe from a set of ingredients (Kiddon et al., 2016), such as 'garlic,' 'steak', and 'soy sauce' (Figure 1). A generated recipe should cover all of those ingredients, without hallucinating any new ones (such as 'pork' or The physician told the baker that she had cancer. Der Arzt sagte dem Bäckerin, dass er Krebs habe. (Stanovsky et al., 2019) Bäckerin Bäcker

Evaluate Gender Bias in MT
Recipe Generation (Kiddon et al., 2016) input output Mix 1 tablespoon butter, parsley, garlic and soy sauce. Sprinkle steak with salt. In a large skillet, heat remaining butter over medium heat. Add steak; cook until meat reaches desired doneness, 4-7 minutes per side. Serve with garlic butter. Figure 1: Overview of several constrained generation tasks. For instance, generating a short description from a given set of concepts (COMMONGEN (Lin et al., 2020)) requires using each of those words at least once; this can be expressed as a logical expression (here, 'food ∨ foods) ∧ . . .'). Our proposed NEUROLOGIC DECODING algorithm handles all predicate logic constraints efficiently -with the same asymptotic runtime as beam search.
'beans'). This restriction, like others shown in the Figure 1 for other applications, can be modeled by a set of lexical constraints expressed as a predicate logic formula.
The dominant paradigm today for performing such constrained generation is to start with a pre-trained language model, and then finetune it on a dataset of task-specific examples. However, pretrained language models struggle at learning to follow these constraints, even when the finetuning dataset is large. For example, for the aforementioned recipe generation task, a GPT2 model finetuned on hundreds of thousands of recipes still hallucinate extra ingredients. In stark contrast, humans need to see only a few examples (or even none) to generate the desired output satisfying all the logical constraints, e.g., writing a recipe that mentions each ingredient (butter, steak, etc.) without using new ones.
We hypothesize that this mismatch is due to a fundamental under-specification of finetuning. If we finetune one of today's state-of-the-art language models on a dataset, the likelihood of it generating sequences from the same distribution should increase. Yet there is no guarantee that this improvement in likelihood will come from improvements on the fundamental task of constrained generation, as opposed to picking up on dataset-specific patterns such as language style. In fact, we present analysis suggesting that 'worst-case' learning behavior is common in practice: when we increase the finetuning data fed to a GPT2 model (with beam-search decoding) by an order of magnitude, constraint-satisfaction shows little improvement.
To address this issue, we propose NEUROLOGIC DECODING, which effectively enforces the satisfaction of given lexical constraints by controlling the decoding stage of sequence generation. These constraints can be any predicate logic formula, which crucially includes both positive constraints (the word 'butter' must be generated somewhere) and negative constraints ('bean' cannot be generated). These simpler constraints can then be combined through logical connectives to handle more complex requirements such as inflection or synonyms ('beef' or 'steak' both satisfy the constraint of referring to the steak). While beam search aims to maximize the likelihood of the generated sequence, our method searches for optimal output sequences among the strings that also satisfy the given constraints. It also does so efficiently: we convert the hard logic constraints into a soft penalty term in the decoding objective, and augment the beam search to diversify the subset of partially satisfied clauses; these clauses are tracked to reuse computation. NEUROLOGIC DECODING thus effectively and efficiently controls text generation without re-quiring any modification of the model structure or training pipeline.
We evaluate our method on four different text generation tasks: generative commonsense reasoning (COMMONGEN;Lin et al., 2020), recipe generation (Kiddon et al., 2016), data-grounded dialogue response generation (Wen et al., 2015), and posthoc correction of gender bias in MT (Stanovsky et al., 2019). Empirical results demonstrate that NEUROLOGIC DECODING ensures the satisfaction of given constraints while maintaining high generation quality, in turn leading to new SOTA results in both the supervised and zero-shot setting.

Background ∨ Related Work
Today, the dominant approaches for conditional text generation -generate an output sequence y given an input sequence x -are large pretrained leftto-right language models that are then finetuned on a dataset of examples (x, y). These models for generation are typically trained to maximize the conditional probability P θ (y|x), yet how to best decode from these models is a challenge.
The two main approaches for decoding today involve sampling from or maximizing P θ (y|x). Analysis suggests that maximization-based approaches -like much of the past work in constrained decoding -work particularly well for closed-ended conditional generation (Holtzman et al., 2020). The vast search space of natural language makes exact maximization infeasible, so approximations such as beam search (Och and Ney, 2004) are frequently used.

Prior Decoding Approaches
The observation that language models -especially the non-pretrained ones of the past -frequently violate important constraints has led to several approaches for constrained decoding, which we discuss below. There are a few key desiderata that we want out of such a decoding algorithm: a) Expressivity. The algorithm should be able to handle a variety of instance-level lexical constraints, ideally both positive and negative.  can include the length of the sequence N , the number of beam-search beams k, and the number of constraints C.
c) Language quality. The algorithm should nonetheless generate fluent and coherent language, without an exceptionally low conditional probability under the generator.
Earlier approaches, which we discuss below and in Table 1, exclusively handle tasks with positive constraints where a sequence of at most length N must be generated, while containing a set of C required words (or clauses of words). However, for the tasks we study in this paper, many of the helpful constraints are negative, often prohibiting a long list of irrelevant words that we should not use. This makes minimal dependence on C paramount. As such, we group approaches by their runtime in the number of constraints C.
2.1.1 An approach exponential in C: CBS Anderson et al. (2017) propose constrained beam search (CBS), where constraint satisfaction is recorded using a finite-state machine with 2 C states (one for each possible assignment of completed constraints). Beam search is then done over all states with k candidates per state, and these candidates move from one state to another. In the end, we retrieve the best-scoring sequence from the 'all-constraints-satisfied' state. This method suffers from an exponential complexity of O(N k2 C ), making it infeasible for language generation with many constraints.
2.1.2 An approach linear in C: GBS Hokamp and Liu (2017) propose grid beam search (GBS), which groups together hypotheses by the number of constraints that they satisfy. This results in C + 1 separate groups altogether. Each group stores at most k candidates that are expanded at each timestep; these candidates move groups accordingly based on the number of constraints they satisfy. Though GBS has a faster runtime of O(N kC), the choice of data structure also can hurt language quality: the decoder has limited capability to generate sequences where the constraints are filled in a non-greedy order.

Approaches without explicit dependence on C
Dynamic beams for GBS Post and Vilar (2018) propose an approach to reduce GBS's explicit dependence on C. Beam search is done over a single beam, with the slots of this beam dynamically allocated over the C + 1 groups explicitly used by GBS. This approach was made GPU-efficient by Hu et al. (2019a). Still, the language quality issue of GBS remains, and can be worse in practice as fewer hypotheses are considered at each step.
Local editing with sampling Miao et al. (2019) propose Constrained Generation by Metropolis-Hastings sampling (CGMH). This approach begins a sentence by inserting all positive-constraint keywords, in random order. Operations are then randomly sampled to replace, insert, or delete words from the sentence; the probabilities are computed on top of the language model. This approach thus has a runtime (in terms of number of generator calls) independent of the number of constraints; yet in practice it can involve repeated deletions and insertions, reducing efficiency. Moreover, generation quality is highly sensitive to the order of the keywords in the initial state and sampled edits.

Applications of Lexically Constrained Generation
Many prior studies on conditional text generation and revision can be viewed as applications of lexically constrained generation. Examples include incorporating pre-specified lexical constraints (Anderson et al., 2017;Post and Vilar, 2018), userprovided terminology constraints (Hasler et al., 2018;Dinu et al., 2019), noisy automatic constraints  into translation output. Another major use case of lexical constrained decoding is paraphrase generation (Hu et al., 2019a;Kajiwara, 2019;Hu et al., 2019b;Miao et al., 2019), where words in the source sentence that require paraphrasing can be negatively constrained for the output to enforce paraphrasing. Another application lies in images captioning for novel scenes or out-of-domain objects (Anderson et al., 2017) or image captioning with explicit grounding to objects in the scene (Ren et al., 2015;Krause et al., 2016).

Our Method: NeuroLogic Decoding
NEUROLOGIC DECODING generalizes past work by enabling a language model to handle any predicate logic constraint, while yielding high-quality text. Like Beam Search, we search over left-toright sequences to maximize the conditional probability P θ (y|x), yet here we introduce machinery enabling these logical constraints to be tracked efficiently, and diversely (different ways and orders in which the constraints are satisfied). We formalize the notation of logical constraints and present our proposed algorithm to find approximately-optimal solutions for constrained optimization in section 3.1.

Algorithm
NEUROLOGIC DECODING accepts lexical constraints in Conjunctive Normal Form (CNF): where each D i represents a single positive or negative lexical constraint D(a i ) or ¬D(a i ) indicating whether word/phrase a i is strictly included or omitted (respectively). Any combination of such constraints that are expressible as a predicate logical formula can be converted to CNF, and thus handled by NEUROLOGIC DECODING. Notationally, we will refer to each individual constraint D i as a literal, and the disjunction of literals (bracketed AND term, above) as a clause, denoted as C j , with L being the total number of clauses. Our method seeks optimal sequences in which all clauses are satisfied, formallŷ Past work on constrained optimization introduces penalties (Fiacco, 1976) to approximate the constrained optimization problem with an unconstrained problem. Specifically, by adding a highcost penalty term to the objective for violation of constraints: where high α results in significant objective reduction for any violated constraint. Intuitively, this objective aims to optimize both sequence likelihood (term 1) and number of satisfied clauses (term 2). We take a related approach, by attempting to find sequences that do well at both term 1 and term 2. While exhaustive search is intractable, we use a modified beam search to find approximatelyoptimal solutions for this objective. At each decoding step, we fill the beam with the most likely (term 1) candidates across a range of values for term 2, allowing a balance of candidates that are likely, and satisfy the most constraints in each beam. In the end, we take the candidates with the most constraints satisfied (term 2), and select the most probable (term 1) one from this group.

Tracking Clause Satisfaction
When considering whether a generation hypothesis satisfies some clause C i during generation, there are fundamentally 3 options. Case 1: if a positive constraint D(a i ) in the clause is satisfied, the full clause is permanently satisfied by the hypothesis. By definition, the hypothesis contains a i and further decoding cannot change this. In contrast, if only negative constraints ¬D(a i ) are satisfied (case 2), it is possible this clause will become unsatisfied if further decoding produces the word/phrase a i . Finally, case 3 is the simple case in which the clause is not (yet) satisfied.
Fundamentally, we only need to keep track of clauses in case 2 or 3, as we know the value of indicates literal or clause is satisfied, indicates not satisfied, --tracks matched prefix.
Group valid candidates into buckets .
based on satisfaction status of clauses. Iteratively pick highestscoring candidate within each bucket to fill in the next beam Filled indicates candidates whose score are within top-α, filled indicates candidates whose number of satisfied clauses are within top-β, the valid candidates . are the intersection of both. Figure 2: Algorithm of selecting hypothesis to fill in next beam from candidates. w i denotes token in vocabulary, w i w j is a phrase with 2 tokens any clause in case 1 for the final generation already. Thus, we only track constraints that appear in a clause in case 2 or 3 for some hypothesis in the beam.
To efficiently track the occurrence of a constrained word/phrase as the generation proceeds, we keep track of the matched partial prefix of each literal in the ongoing generation by maintaining a pointer to its token(s). At the beginning, the pointer is set to the head of the literal. At each time step, if the next generated token matches the token of the pointer, we advance the pointer forward, otherwise we set the pointer back to the head of the literal. A special case is when the next generated token does not match the token of the pointer but matches another token of the literal, in which case, we set the pointer after the longest prefix of the literal that is also a suffix of the current generation. If the pointer reaches the end, this constraint has been generated in the decoded sentence, and we'll mark this literal as satisfied or unsatisfied depending on whether it's positive or negative, and update its clause status accordingly.

Optimization
At each time step, the decoding model generates a distribution over all vocabulary V for k hypotheses in the current beam, resulting in a candidate score matrix of size k×|V |. Standard beam search fills in beam at time step t + 1 with candidates corresponding to the k best scores. Our method, on the other hand, select k candidates that represent a range of optimality between score and number of satisfied clauses.
Along with generating score matrix, we produce a constraint state for each of the k × |V | new candidate hypotheses h, based on the next token considered. We discard any h with unsatisfiable clauses, i.e. any clause with only negative constraints that all appear in h, to focus only on candidates that might satisfy all constraints.
Next is selection of a beam of k candidates h for the next decoding step. Here, we aim to balance high decoding score with high number of satisfied constraints, to ensure progress on satisfying constraint clauses, without resulting in degenerate solutions. We achieve this in two ways.
First, we filter candidate hypotheses h to those in the top-tier of both satisfied constraints and decoding score. Specifically, we drop any candidates not in the top-α in terms of decoding score, and not in one of the top-β bins in terms of number of satisfied clauses. These are adjustable parameters, deciding how diverse a set of candidates to choose the next beam from. Next, we select the beam (k candidates). We aim to represent a range of natural sentences (i.e. high decoding score) while also encouraging constraint satisfaction, and so will include a range of candidates that optimize for both of these things. We start by binning candidates by set of clauses satisfied.
We then proceed in rounds of filling the beam, until we reach k candidates. In each round j, we take the j th candidate in each bin, as ranked by decoding score, then fill in the beam from this set in order of decoding score. In the first round, this means taking the top-scored candidate for each unique set of satisfied clauses. By taking one example from each bin (defined by constraint set) we assure that the beam is filled with candidates that satisfy a range of different constraints. By selecting for decoding score, both within each bin and when filling the beam, we prioritize natural and well-formed sentences. Over many steps, this results in high quality hypotheses that satisfy the desired constraints.
In the end, we bin by number of clauses satisfied, then take the highest-scoring hypothesis from the highest-clause bin, ensuring the final generation satisfies as many clauses as possible.

Experiments I: Constrained
Commonsense Generation COMMONGEN (Lin et al., 2020) is a benchmark dataset designed as a test of generative commonsense reasoning. Given a set of common concepts (e.g., dog, frisbee, catch, throw); the task is to generate a coherent sentence describing an everyday scenario using these concepts (e.g., "a man throws a frisbee and his dog catches it").

Problem Formulation
The input is an unordered set of k concepts x = {a 1 , a 2 , . . . , a k }, where each concept a i is a common object (noun) or action (verb). The expected output is a simple, grammatical sentence y ∈ Y that describes a common scenario using all given concepts in x with correct morphological inflections.
To apply NeuroLogic Decoding, we impose that each a i must appear in output y under some morphological inflection. Letã i = {ã i 1 , . . .ã i |ã i | } denote all inflections of a i . y covers concept a i , if at least one of {ã i 1 , . . .ã i |ã i | } appears. Formally, where D(ã i j ) is a boolean-value function indicating whether y containsã i j or not, as defined above. The equivalent predicate logical expression is Approach and Baseline The standard pipeline of approaching this problem is to consider it as a conditional sentence generation task. We experiment with several recent pre-trained language model, including All models are finetuned with their default hyperparameters. We compare with commonly used decoding method, including beams search, sampling, and also previously proposed constrained decoding method. We use several widely-used automatic metrics to automatically assess the performance as BLEU, ROUGE, METEOR, which mainly focus on measuring surface similarities. We also include metrics specially design for captioning task, such as CIDEr, and SPICE. Following Lin et al. (2020), we report the concept Coverage, which is the average percentage of input concepts that are present in lemmatizatized outputs.

Results I: NeuroLogic vs Other Decoding with the Best Supervised Model
In Table 4, we first present comparisons across different decoding methods based on a supervised sequence-to-sequence model, UniLM. As will be seen in later experiments, UniLM is the best performing model for COMMONGEN in a supervised setting. The key observations are:    3. In comparison, all of the previous constrained decoding methods (Hokamp and Liu, 2017;Post and Vilar, 2018;Hu et al., 2019a) attain high constraint satisfaction at the cost of generation quality, as their generation quality is all lower than that of unconstrained decoding methods such as Top-p sampling (Holtzman et al., 2020) and greedy decoding.
The second and the third points above demonstrate that the improved logical expressiveness of NEU-ROLOGIC together with the effective search strategy leads to generation that is both higher quality while satisfying the constraints the most effectively. Table 2 presents experiments across various stateof-the-art pre-trained language models. In this experiment, all models are supervised on the COM-MONGEN training dataset. Under each column, α → β shows the performance using the conventional beam search (α) compared to the enhanced performance using NEUROLOGIC DECODING (β). As before, NEUROLOGIC always improves the performance across all models and all metrics with no exception. Moreover, not only NEUROLOGIC improves the coverage of constraint satisfaction, it also improves the generation quality across all metrics. The improvement is especially substantial when the generation quality is relatively low due to smaller model capability or less efficient model architecture or pre-training.

Results III: NeuroLogic with Unsupervised Models
In this experiment, we test how well NEUROLOGIC works with unsupervised pre-trained language models, with and without domain adaptation. Table   Figure 3: Performance (y-axis) of supervised GPT2-Large on COMMONGEN, with a varying amount of training data for supervision (x-axis). The orange line denotes decoding with NEUROLOGIC, and the blue line denotes decoding with conventional beam search. Even after being supervised on 100% of the training data, the supervised GPT2 does not successfully learn the CommonGen constraints ('coverage') and is even outperformed by the zero-shot GPT2 (i.e., using 0% training data) with NEUROLOGIC. 3 presents experimental results of zero-shot (i.e., unsupervised) constrained generation. With unconstrained decoding, we have zero controllability over the unsupervised language models, as they ignore the problem input and generate irrelevant text.
With NEUROLOGIC, on the other hand, we can dramatically improve the performance on all metrics. Fig 5 demonstrates some generated examples.
In zero-shot setting without any finetuning, the language style of pre-trained LMs might differ from that of COMMONGEN. To further improve the performance, we conduct language domain adaption by fine-tuning the pre-trained language models on the training data portion of COMMONGEN. We observe that after domain adaption, NEUROLOGIC in zero-shot setting outperforms unconstrained generation with supervised finetuned LMs, which suggests that inference-time algorithms can provide a more compute-efficient avenue to draw better from neural models.

Results IV: Ablation
The amount of training data Figure 3 compares the performance (y-axis) of supervised GPT2 with NEUROLOGIC (orange line) compared to supervised GPT2 with conventional beam search (blue line) as a function of the increasing training dataset size (x-axis).
Notably, even after being supervised on 100% of the training data, the supervised GPT2 does not successfully learn the COMMONGEN constraints ('coverage') and is even outperformed by the zeroshot GPT2 (i.e., using 0% training data) with NEU-ROLOGIC.
The model size Figure 4 compares the performance (y-axis) of GPT2 with varying model sizes (x-axis). Regardless of the model size, NEURO-LOGIC (purple line and black line) boosts performance considerably over conventional beam search (blue line). More over, if using NEUROLOGIC, the performance of unsupervised models (black line) becomes comparable to that of supervised models (purple line).
Remarkably, unsupervised models with NEU-ROLOGIC based on smaller networks (black line) often outperform supervised models with conventional beam search based on considerably larger networks (blue line).

Experiments II: Recipe Generation
While COMMONGEN is a sentence level generation task, we evaluate a paragraph level generation task next -cooking recipe generation. Given a recipe title (i.e., the name of the dish), and the list of ingredients, the task is to generate the correct cooking instructions for a given recipe.  [UniLM]: A man is trying to keep his balance as he falls off a board.
[BART]: A man loses his balance and falls off the balance while riding a skateboard.
[T5]: a man loses his balance on the board and falls.
[GPT-2]: A man loses his balance as he rides a roller coaster and falls off the board.
[UniLM]: Someone loses balance on the ride and falls off the balance board.
[BART]: A man loses his balance on a ride and falls off the board.
[T5]: a rider loses his balance and falls off the board.

Supervised Setting
[GPT-2]: The boy lost his balance riding the bike, falling off the bike and hitting his head on the board.
[GPT]: a woman lost her balance riding a horse, falling off the horse, and hitting her head on a board [CTRL]: A woman riding a board loses her balance and falls off the board.

Zero Shot Setting
Figure 5: A generation example of different models in supervised and zero-shot setting with and without NEUROLOGIC DECODING

Problem Formulation
The input is the recipe title, an unordered set of ingredients E = {e 1 , ..., e |E| } where e i can be a single-or multiword ingredient phrase (e.g., 'onions', 'black pepper'). Let G denote the set of all ingredients. The expected output is a paragraph y ∈ Y that describes multi-step cooking instructions.
To apply NEUROLOGIC DECODING, we constrain output y to contain all given ingredients e i in E, and no other ingredients, i.e. no ingredients in G \ E. Ingredients can be referred to with generic terms (e.g., 'vegetables' may refer to 'onions', 'green beans', or 'carrot') and we denote the generic names for ingredient e i as e T i . Formally, our lexical constraint is: Dataset We use Recipe1M+, a large-scale, structured corpus of over one million cooking recipes. On average each recipe has 118 words and 9 ingredients. Around 70% of the data is in train set, and the remainder is split equally between the validation and test sets.
Approach and Baseline RecipeGPT (Lee et al., 2020) is an online recipe generation model, whose generation module is a GPT-2 model fine-tuned on Recipe1M+ dataset. (Raffel et al., 2019). Its default decoding algorithms are beams search and sampling, which serves as the baselines for evaluating our method. In addition, we compare against previously proposed constrained decoding method with RecipeGPT. Besides common evaluation metrics for generation task, we introduce explicit measure of coverage of given ingredients and usage of extra ingredients Result  we generate a natural language response given a query type (e.g., informing or querying) and list of facts to convey (e.g., a hotel's name and address).
Problem Formulation The input is query type, unordered set of facts F = {f 1 , ..., f |F | }, where each f i contains attribute and value (i.e. ac-cepts_credit_cards="yes", name="red victorian bed breakfast"). The expected output is a dialogue responses y ∈ Y containing given information.
The lexical constraint here is that all given facts f i must be included in responses y in proper natural language form f N i . We use a very simple template to turn f i to natural language form f N i . (i.e. the natural language form for accepts_credit_cards="no" is "doesn't accept credit cards"). Formally, Dataset We used the hotel and restaurant dialogue system corpus and the same traindevelopment-test split from (Wen et al., 2016). There are 8 query types and 12 types of attributes.
Approach and Baseline The standard paradigm for dialogue generation is to consider it as a conditional sentence generation task and finetune a seq2seq model. While this pipeline work effectively with existing data, but once we have new user queries with new query types or new attributes, the seq2seq model would not be able to generate plausible response. The situation can happen frequently with a dialogue response generation system in application. Thus, we are interested in zero-shot dialogue generation. We give a hand-crafted initial prompt to a pre-trained LM based on the query type and apply NEUROLOGIC DECODING to force given facts include in generation. The pre-trained LM we use here is GPT2 (Radford et al., 2019). The baseline we compare against is seq2seq finetuned LMs with vanilla beam search, including GPT-2 (Radford et al., 2019) Stanovsky et al. (2019). Accuracy refers to correctly translating a person's gender, ∆ S is the difference in performance (F 1 ) between stereotypical and non-stereotypical gender roles (lower is better). The arrows (→) show the results before and after NEUROLOGIC DECODING, where gender is inferred from a coreference model (default) or provided (GT Gender). Our NEUROLOGIC DECODING boosts both baselines by over 28 percentage points of accuracy, with further improvement gained when groundtruth gender markers are provided. 2020) and T5 (Raffel et al., 2019). We also compare with previous SOTA (Kiddon et al., 2016) on dialogue response generation.
Result Table 6 presents the experimental results. We can see that zero-shot generation with proposed method outperforms or matches supervised baselines. This suggests that plugging NEUROLOGIC DECODING into pretrained LMs can lead to a powerful dialogue generation system, we do not actually need massive fine-tuing with extra computational cost to do that.

Experiment IV: Post-hoc MT Revision for Reducing Gender Bias
Learned models exhibit social bias when training data encode stereotypes. In the many languages which associate biological and grammatical gender, the gender of an animate object can be identified via morphological markers. In fact, many MT systems, such as Google Translate or Microsoft Translator exhibit biases, e.g., translating nurses as women and programmers as men, regardless of context. We propose reducing this bias through NEUROLOGIC DECODING.
Problem Formulation We adopt the task setup and dataset of Stanovsky et al. (2019). The input x is an English sentence describing a scenario with people N = {n 1 , . . . , n |N | } who are identified by role. The desired output is a translation y which uses the correct gender inflection in the target language (here, German or French). We obtain indicators of the people's gender identity through coreference resolution, linking each entity with their gendered pronoun. 1 We then constrain the correctly-gendered human entity appear in output y. For a human entity n i , let n F i denote its female inflection in the target language, and n M i denotes its male inflection. Let F denotes set of human entities associated with female character, let M denotes set of entities associated with male. Formally, the lexical constraint is Dataset We use the same dataset as (Stanovsky et al., 2019), which in turn is built over the Englishonly coreference gender-bias studies: Winogender (Rudinger et al., 2018) and Wino-Bias (Zhao et al., 2018).

Approach and Baseline
We use strong baseline transformer-based MT models (Junczys-Dowmunt et al., 2018) for both English-to-French as well as English-to-German. We use the metrics in Stanovsky et al. (2019) to measure gender bias in translated text and report performance with vanilla beam search and NEUROLOGIC DECODING. In addition to our default setting where gender is inferred by a coreference resolution model, we also report results on an oracle setting where NEURO-LOGIC DECODING is given groundtruth gender information about the N people.
Result Our results are shown in Table 7. When provided gender markers given by a coreference model, NEUROLOGIC DECODING increases the accuracy of handling gender correctly by 30.5 percentage points for German, and 28.0 points for French. This even outperforms commercial translation systems -the best result, over any language or system, is Microsoft Translator for German with 74.1% accuracy, whereas NEUROLOGIC DECOD-ING enables the baseline model to get 91% accuracy. The performance increases again by an additional 4% (German) and 8.9% (French) when ground-truth gender markers are used during constrained decoding. Last, the diagnostic results also show that NEUROLOGIC DECODING is particularly effective at reducing (over)reliance on stereotypical gender roles, with a decrease in ∆ S of 9 F1 (German) and 17.6 (French). These results suggest that NEUROLOGIC DECODING a plug-andplay approach for reducing gender bias in existing translation systems.

Conclusion
We propose NEUROLOGIC DECODING, an efficient and general method for generating with arbitrary positive and negative lexical constraints. We demonstrate its intuitive application to 4 different tasks as an extension to existing models, showing broad and consistent improvement to decoding quality.