Meeting the 2020 Duolingo Challenge on a Shoestring

What is given below is a brief description of the two systems, called gFCONV and c-VAE, which we built in a response to the 2020 Duolingo Challenge. Both are neural models that aim at disrupting a sentence representation the encoder generates with an eye on increasing the diversity of sentences that emerge out of the process. Importantly, we decided not to turn to external sources for extra ammunition, curious to know how far we can go while confining ourselves to the data released by Duolingo. gFCONV works by taking over a pre-trained sequence model, and intercepting the output its encoder produces on its way to the decoder. c-VAE is a conditional variational auto-encoder, seeking the diversity by blurring the representation that the encoder derives. Experiments on a corpus constructed out of the public dataset from Duolingo, containing some 4 million pairs of sentences, found that gFCONV is a consistent winner over c-VAE though both suffered heavily from a low recall.


Introduction
A major driver for our participating in the challenge was the curiosity to see whether recent approaches to sentence encoding with the variational auto-encoder (VAE) have any relevance to the generation of diverse sentences. (Bowman et al., 2016) were the first to explore the use of VAE in language generation. The work demonstrated that VAE provides a continuous code space for sentences, where any randomly picked data point in the space can be decoded to yield a coherent sentence, which is significant given that the conventional RNNs do not provide such a capability. The problem with VAE however, is that it has no mechanism to ensure that the meaning of the source sentence is passed over to the output, which often causes a sentence to be altered, or deformed beyond recognition. While VAE is a popular approach people turn to as a way to diversify sentences the model generates, no definitive answer has been found on how to control or tame what it spews out. A typical solution is to fuse a VAE code with the output of a regular sentence encoder, in order to encourage the decoder to output a sentence that retains some semantic features present in the source sentence (Gupta et al., 2017). Also noteworthy is a recent work by (Guu et al., 2018), who building on an idea similar to VAE, talk about modeling the distribution of cosine similarities between word vectors for the input and target. (Li et al., 2015) is something of an odd ball in the pursuit of the diversity in sentence generation. The authors argued that we could achieve the diversity by discouraging the decoder to select candidates that are similar to the input. A clear advantage they have over others is that their scheme does not involve any learning and is straightforward to implement.
The idea that one can view a latent representation as a sample drawn from some probabilistic distribution inspired people to explore its potential in a wide range of tasks and domains. (Miao et al., 2015), while working on document modeling, suggested that we use VAE as a way to get a compact representation for a document. (Fang et al., 2019) argued for using a sample based distribution over Gaussian distribution for a latent code to better express the holistic property of the source sentence.
In this work, we focus on two approaches, both based on VAE: one that attempts to achieve the diversity by generalizing the sentence representation produced by the encoder; and another which randomly perturbs the encoder's output during the sentence generation. We report here their respective performance on a test corpus we carved out of the official training data. For the final submission, we went along with the latter approach.

Translation as Paraphrase
Our effort revolved around two questions: (1) how best to incorporate likelihood scores of target translations that were provided as part of the training data, and (2) how not to rely on an external resource while building a solution. We wanted to know how far we can go using only the data made available to us at the competition, and nothing more. Our answer to the first question takes advantage of the fact that a set of translations associated with each English prompt are considered an equivalence class in the sense that if we take any pair from the set, we can substitute one for the other without significantly affecting its meaning. We may take the likelihood that a human picks a particular sentence (call it X) as a good translation for some prompt (P) as the probability of its being a paraphrase of some other sentence (say Y) from a group of possible translations of which X is part. The intuition here is that if X is more typical as a translation of P, it is more likely to serve as a paraphrase of whatever other way we may have to express P in the target language. Following this idea, we created training data by randomly sampling a pair of sentences (both in the same language) that appear as alternate translations for a given prompt in accordance with their popular rating. For each prompt, we sampled 2,000 pairs of translations (which may include pairs consisting of identical sentences), resulting in 4,601,000 training instances (which amount to 2,300 prompts plus those provided in the development and test set) (Mayhew et al., 2020). 1 We set aside 100 1 For this year's challenge, we worked only on the English-Japanese track. We included both test and development sets as part of training data, as a way to prevent the algorithm from stumbling upon unknown tokens in the test set. We don't see this as much of a problem because each prompt in development and test sets carries no more than one translation, i.e. a training pair we get from the development and test set has a same sentence for both source and target. We made use of MeCab for tokenizing sentences in Japanese.

Encoder Decoder
Figure 2: gFCONV prompts as a private development set and another 100 for testing. We included in each training instance an English prompt as well as its translation in order to prevent paraphrases the algorithm generates from diverging from the meaning of the prompt (Table 1). Figure 1 shows a schematic picture of how our approach works. We feed into the system a prompt and its translation which we assume to be given (via AWS, for example). Out comes its paraphrases (or translations in varied styles). The model we built is essentially one based on Fairseq's convolution to sequence architecture of the type called 'fconv iwslt de en' (call it FCONV) which features 4 convolutional layers for the encoder and 3 for the decoder. 2 The embedding dimension for the input and output token was set to 256. We did not use pre-trained embeddings for either of the languages we dealt with. Neither did we make any architectural change to FCONV. We simply trained it as it was given. A departure comes in the testing phase. Following (Guu et al., 2018), we applied a Gaussian noise on the output of the encoder as it was sent to the decoder (Fig. 2).
where x is an input and E(x) is an output from applying an encoder E on x. u denotes an input to a decoder. A larger noise means a greater disruption in the latent representation coming from the encoder, which we hoped would lead to an increase in the diversity of sentences being generated. We randomly sampled a noise from a normal distribution with the mean set to 0 and the variance rang-

E(x)
< l a t e x i t s h a 1 _ b a s e 6 4 = " 9 O 9 x W X X A L p A o 6 X y r u

Figure 3: Conditional VAE
ing from 0 to 0.6. In what follows, we refer to the scheme as gFCONV.
We also looked at a conditional variational autoencoder (c-VAE), a close cousin of gFCONV for the sake of comparison. While both aim at building a latent representation that embraces the notion of uncertainty, c-VAE differs from the variance based approach in that it seeks to find a probabilistic distribution that defines a range of representations that the encoder churns out. In terms of formulae, this comes to the following (also see Fig. 3 for a visual intuition).
Here z = µ + ϵ * υ with ϵ ∼ Unif[0, 1). µ and υ are a mean and variance, defined as µ = g(x), and υ = f (x), respectively. x is an input, g and f are some arbitrary functions over x. E(x) again denotes the output of an encoder. µ and υ are learnable parameters, which means that they need to be trained to have them work. It is worth noting that gFCONV has no extra 'learnable' parameters. r is a hyper-parameter to be set manually, which determines the degree of contribution of z to a latent representation of x. We combine E(x) and a representation sampled from a Gaussian distribution to build a final encoder output. Our decision to condition VAE on E(x) is motivated by a frequent observation in the past literature that VAE is poor at preserving the meaning of the source sentence, often transforming it beyond recognition. Conditioning VAE on the input is a popular trick to discourage the algorithm from straying too far away from the source.
Implementation-wise, c-VAE was based on FCONV, from which we also built gFCONV. We kept all the hyper-parameter settings intact, e.g. the number of layers, the size and the number of filters, etc. We did not apply any scheduled annealing weight to the KL term in the loss function.
For gFCONV, we varied the variance parameter k (Eqn. 1) from 0.00 to 0.60 in increments of 0.05. For each value of k, we ran gFCONV on the test set 100 times, letting the model output 80 alternative translations for each prompt (Setting k to 0 reduces gFCONV to a vanilla FCONV). This had resulted in a pool of 8,000 candidates for a given prompt under a particular value of k. Out of which we retained only those that had a non zero similarity to gold translations by AWS. 3 We measured the similarity using LASER, 4 along with pre-trained word embeddings from FastText, 5 which LASER requires. We were interested to know how variance affected the performance, in particular how it contributed to improving the diversity.

Results and Discussion
Results are provided in Table 2. The numbers shown were produced using the official scorer. In the following discussion, we concentrate on unweighted scores as our interest here is in knowing how much we improved the raw recall under the current setup. Note that weighted scores do not shed light on the true diversity of sentences we have garnered. Looking at Table 2, we see gFCONV gaining on a vanilla FCONV, whose performance is represented by the numbers at k = 0.00. At k = 0.25, we see the raw recall jumping from 3.83 to 9.57, Micro F1 from 6.97 to 12.22, and Macro F1 from 11.37 to 13.18. Compare the difference between Micro and Macro F1 at k = 0.00 and that we have at k = 0.25. The difference for the latter is much smaller. This suggests that under gFCONV, the performance is more stable across test items compared to the vanilla FCONV. A large divergence at k = 0.00 indicates wild ups and downs in performance, suggesting that the model is doing beautifully well on some but failing miserably on others. In contrast to Micro F1, Macro F1 is blind to how many candidate translations there are for each prompt, so may not give us an accurate picture of how the model is doing on each prompt.
As with gFCONV, we ran c-VAE on the test set 100 times, obtaining 100 distinct pools of candidate translations for each prompt. 6 We report in Table 3, figures that represent performance on all the results combined in the manner we described for gFCONV. We varied r (in Eqn. 2) from 0.1 to 0.5 in 0.1 increments. We observe that c-VAE is somewhat behind gFCONV (in terms of divergence between Micro and Macro F1), though performing well over the baseline (numbers in red). A large gap between (unweighted) Micro and Macro F1 again shows that the model suffers from a fluctuating performance, swinging wildly from one test item to another. The final submission for the official evaluation was prepared using gFCONV at k = 0.10, under the pseudonym 'darkside,' with the official results shown in Table 4. 7

Conclusions
We discussed two approaches as a way to tackle the Duolingo Challenge. One is gFCONV, which takes over a pre-trained sequence model, intercepts and perturbs the output its encoder produces on its way to the decoder. Another is c-VAE, a conditional variational auto-encoder, which seeks the diversity by blurring the representation that the encoder derives. Either approach, it was found, outperformed the vanilla FCONV. We also noted a large discrepancy between Micro and Macro F1, suggesting that the models' performance is not even and fluctuates wildly from item to item. Moreover, there were some test prompts for which the models were not able to find any translations. We recognize that this is an area we need to scrutinize to further improve the performance. In the long run, it would be interesting to see if we can bring to the task recent developments in VAE such as (Bouchacourt et al., 2018).