Longitudinal Studies of Variation Sets in Child-directed Speech

One of the characteristics of child-directed speech is its high degree of repetitiousness. Sequences of repetitious utterances with a constant intention, variation sets, have been shown to be corre ...


Background and motivation
Child-directed speech has many characteristics that set it apart from adult-directed language, such as shorter utterances, lower speech rate, fewer disfluencies, lower syntactic complexity, greater modulation of F 0 and high repetitiousness (Broen, 1972). Here is an example of the latter property from our data: 1 You can put the animals there. You can take the pig and the cat and put them there. Can you put them there? Good. Can you put the pig there too?
Sequences of such (partial) self-repetitions with a constant intention have been called variation sets, and have been shown to account for a large proportion of the language that children hear (Küntay and Slobin, 1996;Clark., 2009, p. 37). 2 Why does this phenomenon occur? To some extent, repetitiousness may serve simply to capture and maintain the child's attention, but our intuitions tell us that it is likely to also facilitate language learning for infants. For example, it may allow for effective segmentation of phonetic material (Bard and Anderson, 1983), and it has been shown to be a predictor of syntax growth (Hoff-Ginsberg, 1986;Hoff-Ginsberg, 1990;Waterfall, 2006). In a similar vein, investigating social and attentional cues in word learning, Frank et al. (2012), point out that the temporal proximity and continuity of repetitious language create supportive contexts where partial understanding of individual utterances can lead to fuller understanding. 3 But variation sets have also been shown to benefit artificial language learning. In an experiment on this, Onnis et al. (2008) showed that adults exposed to input with varation sets performed better in phrase segmentation and phrase-boundary judgement tasks than a control group who heard the same input in scrambled order without variation sets. They note that "[f]rom a computational standpoint, the key characteristic of variation sets is that local mechanisms of alignment and comparison allow even memory-limited learners to discover structure that they would otherwise miss" (Onnis et al., 2008, p. 424).

Related work
Early studies of child-directed language dealing with partial and exact repetition include Broen (1972), Snow (1972), Kaye (1980) and Hoff-Ginsberg (1986;1990). For example, Broen (ibid.,p. 29,43) tracked "clusters of sequential sentences" where "the meaning remains constant". Snow (ibid.) found more partial and exact repetitions to 2-year olds than to 10-year olds. Küntay and Slobin (1996) introduced the term "variation set", by which they meant a contiguous sequence of repetitions with varying form but constant intention. They pointed out that the core of a variation set (and the main vehicle for expressing the intention) is almost always a verb, with optionally expressed arguments. (In the above example from the MINGLE-3 corpus, this verb would be "put".) The possible variations were taken to be "(1) lexical substitution and rephrasing, (2) addition and deletion of specific referential terms, and (3) reordering of constituents" (Küntay and Slobin, 2002, p. 6). Their definition did not include exact repetitions, however. Furthermore, it appears that in order for a new utterance to be considered a member of a existing variation set, the new utterance has to satisfy the above conditions for all of the previous utterances taken to be in the set.
Küntay and Slobin's study was based on transcripts of everyday interaction between a Turkishspeaking mother and her child over a seven-month period, during which the child was between 1;8 and 2;3 years. The finding was that 21% of the utterances occurred within variation sets, and that these sets were positively associated with children's acquisition of specific verbs. A follow-up study of transcripts of another Turkish-speaking mother and a child (at age 1;3 and 2;0 years) showed how the communicative functions of the variation sets changed as a function of age (Küntay and Slobin, 2002). Waterfall (2006) provided the first longitudinal study of variation sets in English, based on 12 mother-child dyads with children between 1;2 and 2;6 years. Waterfall's (2006, p. 21) definition of variation set is somewhat different from Küntay and Slobin's, though it is not clear what effect that has in practice. Basically, she defines a variation set as a sequence of utterances that belongs to the same conversational turn, that relates to the same event or situation, that "have similar or related meanings", and shares at least one noun or verb. Again, it appears that these conditions should hold between all utterances within the set, and like Küntay and Slobin, she did not include exact repetitions. Also, she allowed up to four non-related intervening utterances in a variation set. Waterfall found that children's production of nominal and verbal structures was correlated with peaks in the parents' use of that structure in variation sets. She also found a moderate decrease in the proportion of utterances that are part of variation sets as a function of age, from 17% at 1;2 years to 12% at 2;6 years.
Attempts at automatic extraction of variation sets naturally focus on form rather than function. Brodsky et al. (2007) suggest a simple definition of a variation set as a sequence of utterances where each successive pair of utterances has a lexical overlap of at least one element, excluding words on a stoplist (which includes highfrequency words). Variation sets are thus extracted by comparing pairs of successive utterances for repeated words, resulting in sets with at least one non-stoplisted word in common. Using an automated procedure of this kind, Brodsky et al. obtain a proportion of 21.5% of the words in Waterfall's (2006) corpus occurring in variation sets, and 18.3% of the words in the English CHILDES collection (MacWhinney, 2000). Similar studies have been performed by Onnis et al. (2008) and Waterfall et al. (2010). For example, when Onnis et al. used an automated procedure based on Waterfall's (2006) criteria on the Lara corpus from CHILDES (involving one child between 1;9 and 3;3 years), they obtained a proportion of 27,9% of the utterances being inside variation sets.

The problem
For the purpose of this work, we assume that variation sets play a role in language learning for infants, but we are agnostic as to the precise nature of this role. Rather, the aim is to investigate the longitudinal behaviour of variation sets using a definition which subsumes earlier work but where the repetitiousness may also be, on the one hand, semantic (with no or very little surface repetition) and, on the other hand, prosodic or non-verbal (while displaying exact repetition). To obtain a baseline for the behaviour of this phenomenon in Swedish, we develop a gold standard for variation sets. To facilitate further empirical investigation, we introduce a surface algorithm which we evaluate on the gold standard and apply to Croatian, English and Russian.
2 Criteria for variation sets 2.1 Basic criteria A starting-point for our work is Küntay and Slobin's (1996; definition, which takes variation sets to be sequences of utterances with the same communicative intention but with small differences in form. Basically, our definition subsumes Küntay and Slobin's, but we extend it in certain ways. First, along with Brodsky et al. (2007), we extract variation sets (whether manually in the gold standard or automatically using the algorithm) by comparing successive pairs of utterances: first-second, second-third, etc. Also, up to two intervening utterances (such as interjections) by the parent are allowed any time in a sequence (similarly to Snow (1972, p. 251) and Brodsky et al. (2007)). Furthermore, we allow for verbal input from the child within variation sets. The rationale for this is that our data covers the ages 0;7-2;9 years (see Section 3), and especially in the early dyads the children are still learning to take turns. As for constant intention, we make one exception from this, following Küntay and Slobin (2002): we include question-answer sequences where the parent provides both the question and the answer in variation sets.

Surface and semantic repetitiousness
A difference compared to previous work that we are aware of is that we aim at capturing a continuous scale of surface and semantic repetitiousness, where, at one extreme, the repetitiousness may be purely semantic without any surface similarity at all. Here is an example of this from our data, with approximate translations: Titta här då! (But look here!) Har du sett vilka tjusiga byxor? (Have you seen the fancy pants?) Kolla! (Look!) The intention in each of these utterances is to make the child look at the pants, but there is no overlap whatsoever in form between the utterances.
Figure 1: Phonetic and prosodic analysis of a repetition of the Swedish phrase "Varär gummiankan?" ("Where is the rubber duck?"), uttered by a male speaker. The utterances are ordered with the first utterance on top. The y-axis represents frequency (semitones) and the x-axis represents time (seconds). Black thick horizontal lines show stylised intonation based on tonal perception.

Multimodal variation
Contrary to many previous studies of variation sets (Küntay and Slobin, 1996;Küntay and Slobin, 2002;Waterfall, 2006), we include exact (verbatim) repetitions in our definition of variation sets. This is motivated by the result of a study that we made of three dyads in the multimodally annotated MINGLE-3 corpus (see Section 3). When-ever word-for-word repetition occurred in the written transcript of the three dyads, we found consistent patterns of prosodic variation in the parents' speech, involving pitch, timing and/or stress, and typically also variation of their non-verbal cues, involving eye gaze direction, deictic gestures or object manipulation. Figure 1 shows a phonetic and prosodic analysis of a variation set from our data with three exact repetitions of the Swedish utterance "Varär gummiankan?" ("Where is the rubber duck?"). 4 The vertical line indicates time-synchronized starts of the repetitions. In the analysis window for each repetition, a black thick horizontal line shows stylized intonation based on tonal perception (perceived pitch). A downwards tilted line means falling intonation (from brighter to deeper voice), upwards tilted means rising intonation. In the background, the waveform and intensity (thin line) can be seen. The annotation rows beneath each repetition contains phonetic transcription in IPA (top row) and syllable segmentation (second row). The third row at the bottom contains an orthographic annotation. 5 Here, the first utterance (shown at the top of the figure) initially displays relatively flat intonation, and then rising intonation with a peak on the first syllable in the noun "ankan" ("duck"), with a fall on the last syllable. In contrast, the second utterance has shorter duration and falling intonation throughout. Finally, the third utterance has completely flat intonation, with duration similar to the first utterance but with a prolongation of the first syllable, corresponding to the adverb "var" ("where").
Although this is just a small study, the fact that variation is here being systematically manifested through prosody and/or non-verbal cues when the wording is constant fits well with our general impression of exact repetitions. It is because of this multimodal variation that we include verbatim repetitions in variation sets.
(Björkenstam and Wirén, 2014), consisting of 18 longitudinal dyads with three children (two girls, one boy) recorded between the ages of 7 and 33 months with six dyads per child, all of which is multimodally annotated. The complete duration of the 18 dyads is 7:29 hours (mean duration 24:58 minutes). The video and audio recordings were made from naturalistic parent-child interaction in a recording studio at the Phonetics Laboratory at Stockholm University (Lacerda, 2009). The children were interacting alternately with their mothers (10 dyads) and fathers (8 dyads). The scenario was free play. 6 The ELAN annotation tool (Wittenburg et al., 2006) was used for transcription of parent and child utterances, as well as annotation of eye gaze, deictic gestures and object manipulation (Björkenstam and Wirén, 2014). The transcripts have been automatically annotated with part-of-speech and morphosyntactic tags using Stagger (Östling, 2013), followed by manual correction.

Creating a gold standard
The manual annotation of variation sets started with analysis of four dyads, based on a guideline according to the criteria in Section 2. The same criteria were applied throughout all age groups. The annotations were made in ELAN, using timelines to code the extensions of variations sets across utterances, and taking into account both verbal and non-verbal input from parent and child from transcriptions, audio and video.
Each of the four dyads was annotated by two coders independently. The resulting annotations were merged, and a third annotator marked cases of disagreement. This resulted in an interannotator agreement (measured as set overlap between annotators) of 78%. The remaining 14 dyads were annotated by one annotator. During this phase, a classification of communicative intention based on the Inventory of Communicative Acts-Abridged (Ninio et al., 1994) was added. This classification was evaluated by comparing four representative dyads annotated by three independent annotators, resulting in a Fleiss's kappa of 0.63. Table 1: Results of the longitudinal study of Swedish variation sets (also used as gold standard in Table  3). The third row shows the proportions of child-directed utterances that are in variation sets. Each figure is obtained by first calculating the proportion per dyad and then averaging the proportions over all dyads in the respective age group. Boldface indicates statistically significant difference to boldfaced neighbour (z-test of sample proportions; respectively, z = 8, p < 0.0001, z = 2.3, p < 0.02, z = 8.2, p < 0.0001, two-tailed). The fourth row shows the proportions of exact repetitions within variation sets. Each figure is obtained by first calculating the proportion per variation set and averaging over the dyad, then averaging over all dyads in the respective age group.

Results: Gold standard variation sets
In order to obtain a baseline for how the proportion of utterances that are in variation sets varied as a function of age of the children, we grouped the dyads according to child age in the following four data sets: Age group 1: 0;7-0;9 (7-9 months) Age group 2: 1;0-1;2 (12-14 months) Age group 3: 1;4-1;7 (16-19 months) Age group 4: 2;3-2;9 (27-33 months) As shown in Table 1, our gold standard displayed a consistent decrease in the proportion of utterances in variation sets over time, from 50% for age group 1 to 14% for group 4. The proportion of verbatim repetitions in variation sets also decreased, from 24% for age group 1 to 10% for group 4.

Automatic extraction of variation sets
The method that we use for extracting variation sets is deliberately surface-based to allow us to determine how far this can bring us relative to our gold standard, which is based on both surface and semantic criteria. As mentioned above, the algorithm performs a stepwise comparison of pairs of successive utterances. The criterion for including two successive utterances in a variation set is that the difference between them (regarded as strings) does not fall below a certain similarity threshold. Additionally, following Brodsky et al. (2007) and others, we allow for sequences of maximally two intervening dissimilar utterances that do not obey this condition.
For string comparison, we used Ratcliff-Obershelp pattern recognition (Black, 2004) as implemented in the Python module difflib. 7 We refer to the variation-set extraction algorithm using this as "difflib ratio", DLR. 8 When comparing two strings, the matcher returns a value between 0 and 1. A value of 1 corresponds to an exact repetition, and 0 corresponds to two utterances without any overlap of words. By using this value as a parameter, we can obtain a threshold for the desired degree of similarity. The threshold can either be selected arbitrarily, or learned from evaluation against the gold standard variation sets. When evaluated against the gold standard, the optimal similarity threshold was 0.55 (see Figure 2).
We experimented with including information from the part-of-speech tagging of the transcripts (see Section 3) in such a way that the pair of strings compared consisted of both the words and their part-of-speech tags. Our intuition was that this might give us a more refined analysis, for example, by distinguishing cases of homonymy. This version of the algorithm turned out not to improve performance, however (see Figure 2), and was therefore dropped.  We evaluated the algorithm against the gold standard variation sets using two kinds of metrics, which we refer to as strict and fuzzy matching. Strict matching requires exact matching on the utterance level of the extracted variation set and the corresponding gold standard set, whereas fuzzy matching allows for partial overlaps of the extracted variation set and the gold standard set. In the example in Table 2, only utterance 3 and 4 are members of the gold standard variation set, whereas the algorithm extracts utterances 1-4. Hence, the strict matching metric treats this extracted set as a false positive, whereas the fuzzy matching metric treats it as a true positive. As for fuzzy matching, we need a way of calculating precision for different degrees of overlaps with the gold set. The measure we have adopted for this purpose is mean average precision (MAP), see Croft et al. (2009, p. 313). Table 3 summarizes the results of extraction of variation sets relative to the gold standard according to the strict and fuzzy metric. Strict F-score reaches 0.56 and fuzzy F-score reaches 0.82 for age group 1, but F-scores gradually decrease with increasing age. Apparently, the variation displayed in the parents' speech becomes less amenable to surface methods as the children grow older. An indirect sign of this increased complexity in variation sets is that the proportion of exact repetitions decreases as the children grow older, as shown in Table 1.

Extraction of variation sets in Croatian, English and Russian
To investigate the behaviour of variation sets in other languages, we ran the algorithm with lon-  Table 4: Results of the algorithm for automatic variation-set extraction applied to Croatian, English and Russian child-directed utterances from CHILDES. The rows show the number of utterances in each age group, the average proportion of utterances that are in variation sets, and the average proportion of exact repetitions in the variation sets, with figures having being calculated in the same way as in Table 1.

Language
Features of the data set Group 1 0;7-0;9 Group 2 1;0-1;2 gitudinal corpora in Croatian, 9 English 10 and Russian from CHILDES (MacWhinney, 2000). 11 Although it was not possible to find a perfect correspondance with the age groups for Swedish, Table 4 shows how the selection of languages and transcripts from CHILDES partly matches the Swedish data. As shown in Table 4, both the proportion of variation sets and the proportion of exact repetitions as far, as can be seen, decrease consistently for Croatian, English and Russian.

Discussion
In our study of the Swedish gold standard, we obtained statistically significant decreases in the proportion of utterances within variation sets as a function of age between all age groups, from 9 Kovacevic: Vjeran, files 20 (0;10 years) 23 (1;2 years), 33 (1;7 years).
10 Lara, files 1-09-13 (1;9 years), 2-06-00 (2;6 years). 11 Protassova: Varv, files 01 (1;6 years), 04 (1;10 years), 06 (2;4 years). 50% for age group 1 to 14% for age group 4 (see Table 1). These differences were also more consistent than in Waterfall (2006), who obtained an overall decrease from 17% for 1;2 years to 12% for 2;6 years (ibid., p. 125). Waterfall's age span was shorter than ours, 12 but its decrease was still less pronounced within the comparable age interval. It is also interesting to see that we obtained the largest proportion of variation sets for the youngest age group (0;7-0;9 years), which was not covered by Waterfall. The fact that we see larger age-related differences in our data does not seem to be attributable to the inclusion of exact repetitions in our variation sets, judging from the proportiones of these in Table 1. In any case, and as argued in Section 2, the reason for extending the definition of variation sets in this way is motivated by an in-depth analysis of a subset of these utterances in our multimodally annotated corpus. We conjecture that when an utterance is repeated verbatim, there is instead multimodal variation that increases the information and helps the child learn from the utterance. As far as we know, our longitudinal figures on proportions of exact repetitions are also the first that have been reported.
Our automatic algorithm for variation set extraction is deliberately surface-based in order to test how far this kind of method can bring us. An independent advantage is that it is easily replicable since it is based on a standard library for string comparison. The algorithm reaches a fuzzy Fscore of up to 0.82 (strict: up to 0.56) relative to the Swedish gold standard in spite of only using criteria related to form. The F-score drops as a function of age, however (see Table 3); we conjecture that this is due to the relation between form and intention becoming less transparent with increased age. That is, as the child develops and learns more language, the parents' variation gets more complex. One way of handling this complexity would be by generalizing the algorithm to recognize intention.
Since the algorithm only uses form-based criteria, it is in principle also language-independent. We obtain consistent decreases of the proportions of utterances in variation sets also when we apply the algorithm to Croatian, English and Russian corpora of child-directed language (see Table 4). Although in this case we have no evaluations, it is interesting to see that the behaviour corresponds to what we expected.

Conclusion
We have investigated the longitudinal behaviour of variation sets in child-directed speech according to a generalised definition. Variation sets appear to function as a device for effective communication and learning with young children: the speaker repeats the same content while varying the wording, prosody and/or non-verbal cues in order to maximise the chance of comprehension. With increasing age and language comprehension, there is less need for such repetitiousness.
Our study of Swedish covered a larger age span and displayed a more consistent decrease than Waterfall's (2006) study of American English. Our automatic algorithm seems to usefully approximate manual extraction of variation sets at least for lower age groups, and an advantage is that the algorithm is easily replicable. Applications of the algorithm to Croatian, English and Russian displayed similar decreases in the proportions of utterances in variation sets as a function of ages. We also found that the proportions of exact repetitions are similarly decreasing as a function of age for all languages, and we have demonstrated how multimodal cues seem to provide other dimensions of variation in these utterances.