On task effects in NLG corpus elicitation: a replication study using mixed effects modeling

Task effects in NLG corpus elicitation recently started to receive more attention, but are usually not modeled statistically. We present a controlled replication of the study by Van Miltenburg et al. (2018b), contrasting spoken with written descriptions. We collected additional written Dutch descriptions to supplement the spoken data from the DIDEC corpus, and analyzed the descriptions using mixed effects modeling to account for variation between participants and items. Our results show that the effects of modality largely disappear in a controlled setting.


Introduction
Natural Language Generation (NLG) systems are increasingly trained on the basis of datasets of human-produced examples, for example in the recent E2E-challenge (Dušek et al., 2018), or in automatic image description (Bernardi et al., 2016).The quality of the system output depends to a large extent on the quality of the data that is used to train the system, which in turn depends on the way that data is collected.A recent trend in NLG is to study task effects in the creation of corpora for natural language generation (Baltaretu and Castro Ferreira, 2016;van Miltenburg et al., 2017;Ilinykh et al., 2018).However, there does not seem to be an established methodology to investigate whether differences in task design lead to any significant differences in the output.This paper uses a tightly controlled approach to study task effects in NLG.
As a case study, we look at the effects of modality in an image description task.In their exploratory study, Van Miltenburg et al. (2018b) found that spoken and written descriptions differ in several ways, with the main result being that speakers have a greater tendency to show themselves through the use of 'egocentric language' (Akinnaso, 1982).The problem with this study is that it did not use matched corpora (containing exactly the same images) and their experiment did not control for the demographics of the participants.Therefore this paper presents a controlled replication of the study by Van Miltenburg et al. (2018b), to see if its findings are robust.
We carried out a between-subjects study where participants were assigned either to the SPOKEN or the WRITTEN condition.All participants were asked to describe the same images.For the former condition, we used the data from the Dutch Image Description and Eye-tracking Corpus (DIDEC; van Miltenburg et al. 2018a).For the latter condition, we collected additional data using a similar sample of participants.We analyzed the effects of modality on the elicited descriptions using mixedeffects models, controlling for variation in participants and images used to elicit the descriptions.We only found a significant effect for prepositions (used more in written descriptions); other effects disappear in a controlled setting.
This paper contributes to our understanding of the linguistic aspects of image descriptions (e.g., Ferraro et al. 2015;van Miltenburg et al. 2016;Alikhani and Stone 2019).Still, the main takeaway from our study is methodological: for studying task effects in elicitation tasks, we should control for individual variation and the effects of the stimuli used in the experiment.We hope that this study can serve as an example for the use of mixed effects modeling in natural language generation. 1   (DIDEC;van Miltenburg et al. 2018a).This dataset contains 307 different images from the MS COCO dataset, with 14-16 spoken descriptions per image.The authors measured the following kinds of dependent variables: LENGTH: Token length (in syllables or in characters), description length (in tokens).Both are measured after tokenizing the text.PART-OF-SPEECH: (Attributive) adjectives, adverbs, prepositions.These are detected using a part-of-speech tagger (SpaCy 2.0.4).SEMANTIC CATEGORIES: negations (no, not), pseudo-quantifiers (few, lots), consciousness-ofprojection terms (seem, appear, maybe, positive allness terms (all, every), and self-reference terms (I, me, my) are detected by matching word tokens with a word list.Table 1 provides an overview.OTHER: Propositional Information Density (PID; Turner and Greene 1977), which corresponds to the average number of propositional ideas per word in a text, and is computed through an external tool (Marckx, 2017).Mean-segmental type-token ratio (MSTTR; Johnson 1944), which is a measure of diversity (the average number of types per segment).
Findings.Van Miltenburg et al. (2018b) found no consistent differences between spoken and written descriptions for token length, MSTTR, PID, or the use of adjectives or prepositions.The authors did find that spoken descriptions are longer, and contain more adverbs, negations, positive allness terms, self-reference terms, pseudo-quantifiers, and consciousness-of-projection terms.This led them to conclude that speakers have a greater tendency to show themselves through the use of 'egocentric language' (Akinnaso, 1982).What the authors mean by this is that spoken descriptions are not just neutral and detached, but that they also tend to communicate something about the observer who generated the description.For example, if a participant says that some entity X looks like or might be a sheep (i.e., describing the entity using consciousness-ofprojection terms), then their description also signals their uncertainty about whether X is a sheep or not.Written descriptions typically avoid this kind of language (Akinnaso, 1982).
Limitations.The original study did not control for the content of the images, or for the demographics of the participants.Furthermore, it did not control for the setting: the DIDEC dataset was collected in a laboratory setting, whereas the written sample was collected through a crowdsourcing task.This makes it hard to determine whether the results were actually due to the difference in modularity, and not due to any other difference.Hence we set out to provide a controlled replication.

The current study
The current study was set up to provide a more controlled comparison between spoken and written image descriptions.We collected written descriptions for the images from the Dutch Image Description and Eye-tracking Corpus, so that we could compare these written descriptions to the existing spoken data.We used a different sample of participants from the exact same population (the Tilburg University participant pool) to generate the descriptions, so that we could isolate the effect of modality on the generated descriptions.
Participants.Our participants were 48 Dutch students (33 women, 15 men, with a mean age of 21.6) who earned course credits for their participation.Our study followed standard ethical procedures.We obtained IRB approval for this study, and all participants were asked for their informed consent.Participants were allowed to quit the experiment at any stage and still earn credits.
Materials.We used the same 307 images (originally from MS COCO) that were used for the creation of the DIDEC dataset.In the original task, participants provided spoken descriptions for 102 or 103 images in one session.However, written language is typically slower to produce than spoken language; data from Van Miltenburg et al. (2017) shows that the median time for crowdworkers to write 5 descriptions is 294 seconds.Extrapolating from this, we expect that it would take 49 minutes to write descriptions for 50 images.To ensure that participants are able to finish the experiment within one hour (and to avoid fatigue), we shortened the lists to 51 or 52 images.
Design.We used a single-factor (modality) between-subjects design, where the participants who took part in the DIDEC study serve as the SPO-KEN group, and we collect additional responses for the WRITTEN condition.Because both sets of participants are sampled from the same population, we can compare their descriptions for the same images to examine the effect of modality.However, we do note that a within-subjects design would have more statistical power, since we would also have information about the effects of modality for each participant. 2Our choice for a between-subjects design was motivated by economic reasons: it would have been very time-consuming to build a new corpus of spoken image descriptions.
Procedure.The elicitation task is similar to the one carried out by Van Miltenburg et al. (2018a) for the DIDEC dataset.We implemented the task using Qualtrics,3 so as to have a simple web interface.The participants sat in a computer room with 20 computers.They were not allowed to communicate with each other.After reading the instructions and signing the consent form, participants first carried out a practice trial, after which they could ask clarification questions.For the main task, participants were presented with a list of images, and asked to describe each of the images in one short but complete sentence.
Dependent variables.Our dependent variables are almost the same as in the original study; we ignore MSTTR for reasons of space. 4We modified the original (public) scripts to prepare the results for our analysis.Whereas the original study reported average results over the aggregated data (per 1000 tokens or per description), we measure the variables for each individual description.

Statistical analysis
In addition to the effects of modality (SPOKEN and WRITTEN) .We used the lme4 package (Bates et al., 2015) to build our models in R (R Core Team, 2017) and the lmertest package (Kuznetsova et al., 2017) to provide pvalues for linear mixed effect models.We created a separate model for each dependent variable and assessed the effect of modality for significance. 5 When significant, the null hypothesis of no difference between the means of the written and spoken condition is rejected (implying there is a task effect).For each model, we specify the relevant type of distribution.We model sentence length, token length, and propositional idea density as continuous data, and assume a standard Gaussian distribution.The other variables correspond to count data, modeled using the Poisson distribution (through the glmer function).

Results
We collected 2457 descriptions from 48 participants.Table 2 provides general statistics about the spoken and written descriptions.Descriptive statistics are provided in Table 3.Compared to the spoken descriptions, written descriptions are longer, have longer tokens, and (with the exception scores (one per participant) and see whether there is a significant difference in the scores between the two conditions. 5We use the traditional significance level of α = 0.05, and correct for multiple comparisons using the Bonferroni method: Table 3: All models with their dependent variables, whether we expect a difference (less/greater than: difference, equals: no difference), the mean results for the spoken descriptions, difference between written and spoken descriptions (w-s), the data type used in our analysis, the fixed effect (β) of the written modality on the outcome, the standard error, statistic, and the p-value for the model.
of consciousness-of-projection terms) contain more terms from each category.The direction of these differences is surprising, because they are opposite to our expectations (again with the exception of consciousness-of-projection terms).For example, we expected spoken descriptions to be longer than written ones, indicated as 's>w' in Table 3.To assess whether these observed differences generalize outside of this particular dataset, we assessed their statistical significance using the linear mixed effect models described earlier.
Model convergence.Initially, the models for token length (syllables), self-reference, and allness terms failed to converge (i.e.find stable estimates of the effects).We addressed this issue in two ways: 1.For the token length model, we used a different optimizer (bobyqa); 2. For self-reference and allness terms, we modeled the presence or absence of the relevant terms with a binomial distribution.After this, only the model for positive allness failed to converge; likely because only 30 out of 7,056 descriptions contained positive allness terms -not enough positive examples.
Main results.The last four columns of Table 3 show the effect of modality on the dependent variables (full models are in the supplementary materials).We only found a significant effect of modality on the use of prepositions: written descriptions use more prepositions than spoken ones.
We found no significant effect of modality on any of the other dependent variables.(Note that this is partly due to the Bonferroni correction we applied earlier.If we had not corrected for multiple comparisons, we would have judged the models for sentence length, pseudo-quantifiers, consciousness-of-projection terms, and self-reference terms to be significant at α = 0.05.)This means that while those models may capture general tendencies in the data, there are no consistent differences between spoken and written language for these variables.
Model interpretation.Although most of our analyses do not show significant differences, we can still interpret the way they capture the overall distribution of the data.The strongest nonsignificant effect is observed for sentence length; on average, written descriptions are β=2.6 words longer than spoken ones.

Discussion
We will now briefly summarize and explain our results, before discussing their implications.

Summary of the results
We aimed to replicate the findings by van Miltenburg et al. (2018b), who looked at modality effects in the elicitation of NLG corpus data.Like the original authors, we found no significant difference for token length, PID, or the use of adjectives.While Van Miltenburg et al. did not find a consistent difference in the use of prepositions for both English and Dutch prepositions, we replicate their finding that written Dutch descriptions contain more prepositions than spoken ones.This is in line with earlier findings by Drieman (1962) and Chafe and Danielewicz (1987).
All other effects disappear in a controlled setting.This is not to say that there is no effect of modality, but that the effect is smaller than can be detected with these variables and this elicitation method.We are unsure how the original effects emerged, but a likely explanation lies in the differences between the datasets used in the original study (which contain different images, and were collected in a different setting, with less comparable participants).This shows the importance of setting up a controlled study, where such differences are minimized, and we can isolate the factor that we are interested in (here: modality).

Rarity and the need for guidelines
One other factor contributing to the difficulty of finding statistically significant effects for modality is that many of the phenomena under investigation are low-frequent.Positive allness terms are the most extreme case, occurring in 0.4% of all spoken descriptions.But attributive adjectives, negations, pseudo-quantifiers, consciousness-ofprojection terms, and self-reference terms also occur in less than half of the spoken descriptions.
It appears that only changing the modality is not enough to observe a (strong) task effect.If we want participants to produce different kinds of descriptions, they will probably need guidelines, with explicit instructions to change their behavior.But this raises the next issue: what should those guidelines look like?

Usefulness of different modalities
One of the reasons cited by van Miltenburg et al. (2018b) to look at spoken image descriptions is that they might provide more natural examples of how people generally talk about images.After all: speech is a more primary form of language (cf.Biber 1988, Chapter 1).Their naturalness would make spoken descriptions more suitable for training voice-operated image description systems.
Our results show that changing the modality of the elicitation task does not necessarily yield qualitatively different descriptions, let alone more natural descriptions.Importantly, our study does not say anything about the usefulness of typical features of spoken language.User studies may still find that emulating the spoken style (as presented in the literature) positively/negatively affects users' appreciation of the output.After establishing desirable properties that image descriptions should have, we can define guidelines for what image descriptions should look like.We may then be able to alter the elicitation task in such a way that participants provide suitable descriptions.Here, the question arises: how do you know whether the elicitation task is successful?This brings us to our next point.

Statistics as a manipulation check
Our failure to replicate the effects of modality for all variables (except the use of prepositions) suggests that (at least for those other variables) it does not matter in which modality you collect the descriptions; they will look more or less the same.In other words, our study served as a manipulation check, to see if the manipulation (changing the modality of the elicitation task) had the desired effect (changing the style of the descriptions).In this case, the manipulation turned out to be unsuccessful.We hope that our study provides a good example for showing (or refuting) the robustness of different task effects in NLG.Note that, for a check like this to be possible, one needs to establish a metric or set of metrics that can be used to quantify the phenomenon that you're interested in.

Conclusion
We presented a controlled study to evaluate task effects in an NLG elicitation task, namely image description.We used mixed effects models to filter out the effects of participants and individual stimuli.Using these models, we learned that modality alone has a minimal effect on the content of the descriptions.Thus, a stronger manipulation is needed to obtain different kinds of descriptions.The methodology used in this paper is suitable for running pilot studies to check whether task manipulations are successful.We hope that future studies will adopt this methodology, so as to ensure fruitful data collection.
, our observations (the individual descriptions) are influenced by two other factors; namely PARTICIPANT and IMAGE.To capture the random effects of both participants and images, we use a linear mixed effect model (Baayen et al. 2008; see Winter 2013 for a tutorial) k=12 models, α=0.00427.