COD3S: Diverse Generation with Discrete Semantic Signatures

We present COD3S, a novel method for generating semantically diverse sentences using neural sequence-to-sequence (seq2seq) models. Conditioned on an input, seq2seq models typically produce semantically and syntactically homogeneous sets of sentences and thus perform poorly on one-to-many sequence generation tasks. Our two-stage approach improves output diversity by conditioning generation on locality-sensitive hash (LSH)-based semantic sentence codes whose Hamming distances highly correlate with human judgments of semantic textual similarity. Though it is generally applicable, we apply COD3S to causal generation, the task of predicting a proposition's plausible causes or effects. We demonstrate through automatic and human evaluation that responses produced using our method exhibit improved diversity without degrading task performance.


Introduction
Open-ended sequence generation problems such as dialog, story generation, image captioning, or causal generation pose a practical challenge to neural sequence-to-sequence (seq2seq) models, as they necessitate a diverse set of predicted outputs.The typical sampling method for seq2seq decoding is beam search, which produces a set of candidate sequences that generally have high syntactic, lexical, and semantic overlap.
Recent methods for improved diversity generation make slight modifications to the neural architecture or beam search algorithm (Xu et al., 2018;Li et al., 2016b), or impose lexical constraints during decoding (Post and Vilar, 2018;Hu et al., 2019a).Shu et al. (2019) propose the use of sentence codes, a technique in which generation is conditioned on a discrete code that aims to induce diversity in syntax or semantics.While their approach is effective for syntactic codes, it is less so for semantics.In this work, we introduce an improved method for diverse generation conditioned on inferred sentence codes that explicitly capture meaningful semantic differences.We use the contextual sentence embeddings from Sentence-BERT (SBERT; Reimers and Gurevych, 2019), the cosine distances between which correlate highly with human scalar judgments of semantic textual similarity (STS).We construct discrete codes from these embeddings using locality-sensitive hashing (Indyk and Motwani, 1998;Charikar, 2002), producing short binary signatures whose Hamming distances well-preserves the cosine distances between inputs.
Our method induces a bitwise hierarchy of semantic bins whose similarities in signature imply similarities in semantics.Conditioning generation on a signature as a target-side prefix indicates the bin into which the generated sequence falls.We implement a two-stage decoding process that (1) infers the most relevant signatures and (2) decodes sequences via separate prefix-conditioned beams.We term our method COD3S: COnstrained Decoding with Semantic Sentence Signatures.
We demonstrate the effectiveness of COD3S in the context of causal sequence generation (Li et al., 2020) through BLEU-and cosine-based diversity measures as well as human evaluation.

Related Work
We draw inspiration from recent work in multilingual machine translation (MT) (Ha et al., 2016) and domain adaptation (Chu and Dabre, 2019) in which a language code (e.g.en, de) is prepended to the target to guide generation.Our method for encoding sentence diversity is closely related to MT work by Shu et al. (2019), who condition generation on prefixed sentence codes.They improve the syntactic diversity of sampled translations using codes produced from improved semantic hashing (Kaiser and Bengio, 2018) with a TreeLSTMbased autoencoder.Their experiments with semantic coding via clustering of BERT (Devlin et al., 2019) and FastText (Bojanowski et al., 2017) embeddings lead to negligible or negative effects.Outside of MT, Keskar et al. (2019) in a similar vein condition on manually categorized "control codes" that specify style and content, and Mallinson and Lapata (2019) condition on annotated syntactic or lexical change markers that can be learnt from data.We refer readers to Ippolito et al. (2019) for an overview of diverse decoding methods.Few to our knowledge explicitly and effectively encode open-domain semantic diversity.
Text-based causal knowledge acquisition is a well-studied challenge in NLP (Radinsky et al., 2012).Recent efforts have investigated open ended causal generation using neural models (Bosselut et al., 2019;Li et al., 2020).The latter train a conditional generation model to propose cause or effect statements for a given proposition.The model is trained on the co-released corpus CausalBank, which comprises causal statements harvested from English Common Crawl (Buck et al., 2014).
Applications of LSH (Indyk and Motwani, 1998;Charikar, 2002) in NLP began with Ravichandran et al. (2005) who demonstrated its use in fast lexical similarity comparison; later, Van Durme and Lall (2010) showed such hashing could be performed online.More similar to our use case, Petrović et al. (2010) binned tweets via LSH to enable fast first story detection.Most related to ours is work by Guu et al. (2018), who describe a generative sentence model that edits a 'prototype' sentence using lexically similar ones retrieved via LSH.

COD3S Approach
Our signature construction method, depicted in Figure 1(a), produces a sequence of bits that collectively imply a highly specific bin of sentences with similar semantic meaning.This is accomplished by encoding sentences into high-dimensional vectors that encode degrees of semantic difference and then discretizing the vectors in a way that approximately preserves the difference.

Semantic Embedding Model
We embed a sentence using the contextual encoder Sentence-BERT (SBERT; Reimers and Gurevych, 2019), a siamese network trained to produce embeddings whose cosine similarity approximates the semantic textual similarity (STS) of the underlying sentences.We select this single sentence encoder over other popular encoders, e.g.BERT, which best encode concatenations of pairs of sentences and therefore do not produce individual embeddings that encode semantic difference retrievable under vector similarity metrics (Reimers and Gurevych, 2019;Shu et al., 2019).The cosine similarity of embeddings from SRoBERTa-L, the instance of SBERT that we use as our COD3S encoder, has a Spearman ρ correlation of .863with human STS judgements from STSbenchmark (Cer et al., 2017). 1 We provide a list of cosine/STS correlations using other models in Appendix E. 2Discretization via LSH Locality-sensitive hashing (LSH; Indyk and Motwani, 1998) maps highdimensional vectors into low-dimensional sketches for quick and accurate similarity comparison under measures such as cosine or Euclidean distance.We use the popular variant by Charikar (2002), which computes a discrete b-bit signature Appendix A provides an overview of this approach.The Hamming distance between two LSH signatures approximates the cosine distance of the underyling vectors: As previous work associates the strength of a causal relationship with pointwise mutual information (PMI) (Gordon et al., 2012), we modify our objective to maximize the MI between x and each of s and y; we adapt the MMI-bidi objective from Li et al. (2016a): As shown in Figure 1(b), we first decode the k-best distinct sentence codes ŝ1 , . . .ŝk as in Eq. 1.We then perform k conditional inferences in Eq. 2; we take the 1-best sentence from each to produce ŷ1 , . . .ŷk .For both signature and sentence decoding, we follow Li et al. and sample an n-best list from the forward score log p(s | x) (resp.log p(y | x, ŝ)) before re-ranking with the added λ-weighted backward score. 3We approximate the forward scores using length-normalized beam search with beam size 100 for signatures and 40 for sentences.While log p(s | x) and log p(y | x, s) can be scored using a single forward model, we find it beneficial to train two, so that the first only learns to score signatures.
Hamming Distance Threshold As sentences whose signatures differ by few bits show to have highly similar semantics, we impose a threshold heuristic for decoded signatures ŝ1 , . . ., ŝk : min i = j D( ŝi , ŝ j ) > t, where D(•) is Hamming distance. 4We enforce this using a greedy algorithm that considers higher-scoring signatures first, keeping those that satisfy the threshold given the currently kept set and removing those that violate it.Taken as a whole, our decoding approach aims to generate the single highest-scoring applicable response that falls in each of the N-best inferred sufficiently different semantic bins.The threshold parameter thus provides a way to effectively tune the model to a desired level of semantic diversity.

Experiments
We apply COD3S to the task of open-ended causal generation for free-form textual inputs as considered by Li et al. (2020).Given an input statement, the model must suggest a diverse set of possible causes or effects.We train models on sentence pairs from Li et al.'s released dataset, CausalBank, which is scraped from Common Crawl using templatic causal patterns.Following their work, we use 10 million sentence pairs that match the patterns "X, so Y" to train cause-to-effect models and "X because Y" for effect-to-cause models.
We experiment with 16-bit LSH signatures of SBERT embeddings. 5After prepending targetside bit signatures, pairs are encoded with bytepair encoding (BPE; Sennrich et al., 2016) using a vocabulary size of 10K.We train Transformer models (Vaswani et al., 2017) using the FAIRSEQ library (Ott et al., 2019).Appendix B provides details for reproducibility. 6valuation We show that COD3S induces sensible inference of diverse but relevant semantic bins and causal statements.Examples of generation are shown in Table 3 and additionally Appendix C. We quantitatively compare COD3S against the out- puts of regular seq2seq beam search, as well as of lexically constrained decoding with disjunctive positive constraints (DPC) and random sample decoding (S2S-RS) provided by Li et al. 7 We included in the comparison instances of COD3S with and without MMI reranking, as well as with random sampling in place of beam search.To measure lexical diversity, we set ∆(y, y ) to be the sentences' inverse (100 minus) BLEU-1 and -2 scores. 8To measure semantic diversity, we set ∆ to be the cosine distance between their SBERT embeddings.Higher scores imply greater diversity.Following Li et al., we evaluate on 100 examples from an out-of-distribution dev split of the Choice of Plausible Alternatives dataset (COPA; Gordon et al., 2012), with results shown in Table 2. 9 In both cases, COD3S outperforms all other methods except random sampling, the addition of which also improves the diversity of COD3S itself. 1010 We also use the SBERT diversity score to count semantically diverse outputs by marking as duplicates those for which the embedding of the completed phrase ("X . . .Y") falls below some distance threshold from that of an earlier candidate.Table 2 (lower) shows that both the best COD3S model as well as random sampling produce far more semantically distinct statements than the beam search baseline.

Automatic Diversity Metrics
Human Evaluation Our automatic metrics quantify diversity without tracking task effectiveness, which we evaluate by collecing judgments on Amazon Mechanical Turk.We ask workers to judge the plausibility of responses as causal completions (on a 0-5 Likert scale).For all methods except COD3S, we use the exact outputs evaluated in Li et al. (2020) and provided to us by the authors.The response sets for these models contain the top 3 decoded sentences under perplexity (PPL).We compare these to the top 3 as well as the top 10 sentences decoded by COD3S with and without MMI re-ranking (signature and sentence, no random sampling) ordered by PPL of the signature tokens.This discrepancy in per-model outputs reflects that we seek to evaluate COD3S, which is specifically crafted to produce a large set of distinct viable candidates, as directly  as possible against the Li et al. ( 2020) responses from models that are not necessarily crafted with the identical aim.Naturally occurring propositions have far more than 10 plausible and distinct causes and effects, and so we would hope that the 10 th output of our one-to-many model would have similar quality to the 1 st of the other models.
Results are shown in Figure 2. 11 We observe that top 1 and 3 COD3S responses according to PPL (blue) are comparable albeit slightly lower on average than those of the other models. 12This may partially be attributed to the difficulty of the signature inference step, in which the differences in the top 100 predicted binary sequence PPLs are typically small.A COD3S 'oracle' that conditions generation on the gold answer's signature (which often has low predicted likelihood) performs more competitively (green).
We find that at least 1 of the top 3 signatures predicted by COD3S yields a competitively plausible sentence; when we take the highest plausibility score from the top 3 of each model under their respective PPL orderings (red), COD3S and baseline S2S to be interchangeable.If we expand to the larger set of 10 outputs for COD3S models, we find that the mean of the 3 highest plausibility scores (faded purple) for the MMI model is comparable to the 1 best of the base seq2seq (red) and better than the mean of the top 3 PPL (faded blue) for any model.This indicates that the 10 output set, which shows under automatic metrics to contain higher numbers of semantically diverse statements, also contains at worst a set of 3 outputs that are better than the 3 from models not designed for one-tomany diverse prediction.
Qualitative Analysis Table 3 shows examples of models predicting and re-ranking sentences within inferred signature bins.Candidate predictions listed in order of MMI score reflect the ability of MMI-based reranking to select the candidates within a bin that are most relevant to the input.Outputs are shown beneath a representative bin medoid, i.e. the sentence with minimized embedding cosine distance from all other training sentences that fall in the bin.The two-step inference process depicted here allows for a level of interpretability on the signature level, as sampling training sentences from the inferred semantic bin gives a snapshot of an inferred semantic space that can be more informative than individual sentences alone.
Future work might explore alternative methods for signature inference.The bit sequence likelihoods predicted by COD3S are often clumped together and/or biased towards signatures that intuitively do not apply to an input but are overrepresented in the training set.We also observe that although MMI decoding discourages bland context insensitive statements, there is still a model tendency towards a small set of generic predicates, e.g.'having,' 'knowing,' or 'being able to.'

Conclusion
We have outlined COD3S, a method for producing semantically diverse statements in open-ended generation tasks.We design sentence LSH signatures that encode bitwise the semantic similarity of underlying statements; conditioning generation on different signatures yields outputs that are semantically heterogeneous.COD3S leads to more diverse outputs in a multi-target generation task in a controllable and interpretable manner, suggesting the potential of semantically guided diverse decoding for a variety of text generation tasks in the future.

A Random Hyperplane LSH Details
The popular LSH variant introduced by Charikar ( 2002) leverages random hyperplane projections to compute discrete b-length bit signatures.Each individual bit is determined from the sign of the dot product between a given embedding and one of a set of b pre-computed random normal vectors.One geometric intuition is that the hyperplane implied by each random normal vector partitions the full embedding space in half, and the sign of the dot product designates the partition into which the input embedding falls.This is illustrated in Figure 3 using a simplified case with a 2-D vector v and three random vectors r 1 , r 2 , r 3 indicating partitions of the Cartesian plane. 13 The number of matching bits in the signatures of two vectors u, v provides an estimate of their hash collision probability, i.e. the likelihood that they fall in the same partition of any random hyperplane.This probability is provably 14 monotonically increasing with the vectors' inner product.Goemans and Williamson (1995) similarly prove that the Hamming distance between signatures is proportional to the angle between the vectors, which correlates highly with cosine distance barring high discrepancies in vector norms. 13Figure adapted from slides of Van Durme and Lall (2010) with permission of the authors. 14Charikar (2002); Li et al. (2013) B Training Details fairseq-train --adam-betas "(0.9, 0.98)" --arch transformer_iwslt_de_en --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --dropout 0.1 --weight-decay 0 --bpe sentencepiece --optimizer adam --clip-norm 0.1 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --max-epoch 10 --share-all-embeddings We train models with FAIRSEQ using the transformer iwslt de en architecture.We use 6 encoder and decoder layers with 512-dimensional hidden states and shared embedding layers (a total of 36.6Mtrainable parameters).Signature tokens are assigned special tokens during BPE encoding.We train models for 10 epochs with an early stopping patience of 2 validations.We use the Adam optimizer (Kingma and Ba, 2015) with 0.1-smoothed cross entropy loss, a 5e−4 learning rate with inverser square root scheduling, 0.1 dropout and 0.1 norm clipping.All other training parameters were the FAIRSEQ defaults at the time of submission.We observe performance drops when 1) norm clipping threshold is not sufficiently low, 2) BPE vocabulary size is 32K instead of 10K, and 3) weight decay is set to .001.Training takes roughly 12 hours on two Titan 24GB RTX GPUs for each of four models (two forward, two backward for MMI reranking).
Backwards scoring models for MMI-bidi are trained with the opposite dataset as their corresponding forward models; we find training most effective when the data's syntactic direction ("X . . .Y") matches the direction of inference (X → Y).In other words, all C→ E models are trained on "X, so Y" data regardless of their use as forward or backward scoring models.We used the "X because Y" training split from Li et al. (2020).We constructed the 10M "X so Y" examples ourselves: we took a 20M random sample of all such examples in the dataset, filtered to remove sentences a) containing numerical and special characters or b) containing either a source or target with greater than 12 tokens, and then downsampled the remaining set to a 10M/4K/4K train/dev/test split.Following Li et al. (2020), the same 100 "X because Y" pairs were used to evaluate models of both inference directions.

C Decoding According to Semantic Bins
We experimented with bit lengths of 8, 16, and 32, and found the middle value to best balance specificity with accuracy.We also explored a variant that merged signatures into a single token rather than treating them as token-per-bit, but found the model to perform qualitatively worse.We experimented with Hamming distance heuristic thresholds of 0 through 6 and found the best value (2) for 16-bit COD3S using qualitative analysis of side-byside predictions.The MMI-bidi λ s , λ y values were found using simple grid search, comparison of automatic metrics, and side-by-side analysis.The nature of the output set is sensitive to only large changes (orders of magnitude) in λ s values, as the likelihoods of signature sequences are rather close in value; however, smaller, 0.1-increment changes to the sentence weight λ y showed to have a greater effect on the relevance and specificity of output causes/effects.This comports with results from previous applications of MMI-bidi decoding for sentences (Li et al., 2016a).
Table 7 shows side-by-side outputs of models with and without MMI re-ranking conditioned on the same n-best inferred signatures.Table 4 shows results of automatic diversity evaluation on the indistribution training sample from CausalBank following Li et al. (2020).Table 5 provides a tabular version of the human plausibility scores depicted in Figure 2.

D Counting Semantically Distinct Outputs using SBERT
We construct a method for automatically counting the number of semantically diverse sentences in a candidate cause/effect set.We encode each prediction with the context of the input by taking the SBERT embedding of the completed sentence "X {because, so} Y." We then rule out all sentences whose embedding cosine distance from that of a higher-ranked candidate is lower than some threshold.We use a simple grid search over various threshold values and find that a value of .1 yields a sensitivity to paraphrastic cause/effect predictions similar to that of a human reader.As other tasks might merit different such thresholds, we provide multiple such counts in Table 2. Effect: the woman hired a lawyer Gold Cause: she decided to sue her employer she wanted to she wanted a lawyer they want to crack down on it (Gold bin) she thought she could win she wanted to be in charge of her case it can be an ideal method for you to succeed she had a plan she felt she had enough evidence it was what we had and it turned out fine she trusted him she wanted to help people I did trust and respect the person she wanted to be a mother she wanted to protect her family all ages enjoy them Effect: I avoided giving a straight answer to the question Gold Cause: the question made me uncomfortable I didn't want to offend anyone I didn't want to offend anyone I didn't like to speak (Gold bin) I didn't understand it I didnt know what I was talking about I didn't understand them there was no one to talk to I didn't want to talk about it I'm not allowed to talk to them about anything the answer was obvious I thought the answer would be obvious everyone's familiar with it I was so embarrassed I thought I was stupid it looked ridiculously saturated Effect: I learned how to play the board game Gold Cause: my friend explained the rules to me I learned a lot about the game I wanted to learn to play the game it offers some good information (Gold bin) i felt like it i felt i had to I feel it to be so it was so easy it was easy to play it is done nicely and realistically it worked i knew i was going to play it they have now got it right I love to play online I wanted to play online the online wants anyone spreading the phrase

Figure 1 :
Figure 1: Overview of the COD3S method.In training (a), the target side is prefixed with a discrete signature computed using locality-sensitive hashing (LSH) of the target's SBERT embedding.At inference (b), a beam search is conditioned on each of k decoded signatures.

Figure 2 :
Figure 2: Results of human evaluation of plausibility.Ratings are shown in comparison to the gold answer and less plausible alternative from COPA.Mean/max ratings per input are presented for 1, 3-best outputs ranked by forward score (PPL).To demonstrate that COD3S produces plausible response from many semantic bins, we also show max ratings from top-10 outputs.

Figure 3 :
Figure 3: Computation of a 2D vector v's LSH bit signature as the signs of the dot products with d random normal vectors r 1 , . . ., r b .Formally, given a set of high-dimensional vectors in R D , we randomly sample b D random vectors r 1 , . . .r d from the D−dimensional Gaussian distribution.Then, given a high-dimensional embedding v, we construct the b-bit signature LSH(v) = [LSH 1 (v), . . .LSH d (v)] using the hash functions

Figure 4 :
Figure 4: Interface shown to Amazon Mechanical Turk workers during collection of plausibility judgments.

Table 3 :
Examples of generation conditioned on semantic bins.Predictions are ranked according to maximum mutual information (MMI) and shown aside the given bin's representative medoid.

Table 5 :
Tabular form of human evaluation results displayed in Figure2.
Table 6 shows example cases of duplicate detection among generated candidate sets.

Table 9 :
Analysis of bin clusters using the effects of 10 million CausalBank "X because Y" pairs.