Filled Pauses in User-generated Content are Words with Extra-propositional Meaning

In this paper, we present a corpus study investigating the use of the fillers ¨ (uh) and ¨ ahm (uhm) in informal spoken German youth language and in written text from social media. Our study shows that filled pauses occur in both corpora as markers of hesitations, corrections, repetitions and unfinished sentences, and that the form as well as the type of the fillers are distributed similarly in both registers. We present an analysis of fillers in written microblogs, illustrating that ¨ ah and ¨ ahm are used intentionally and can add a subtext to the message that is understandable to both author and reader. We thus argue that filled pauses in user-generated content from social media are words with extrapropositional meaning.


Introduction
In spoken communication, we can find a high number of utterences that are disfluent, i.e. that include hesitations, repairs, repetitions etc. Shriberg (1994) estimates the ratio of disfluent sentences in spontaneous human-human communication to be in the range of 5-6%.
One particular type of disfluencies are filled pauses (FP) likeäh (uh) andähm (uhm). FP are a frequent phenomenon in human communication and can have multiple functions. They can be put at any position in an utterance and are used when a speaker encounters planning and word-finding problems (Maclay and Osgood, 1959;Arnold et al., 2003;Goffman, 1981;Levelt, 1983;Clark, 1996;Barr, 2001;Clark and Fox Tree, 2002), or as strategic devices, e.g. as floor-holders or turn-taking signals (Maclay and Osgood, 1959;Rochester, 1973;Beattie, 1983). Filled pauses can function as discourse-structuring devices, but they can also express extra-propositional aspects of meaning beyond the propositional content of the utterance, e.g. as markers of uncertainty or politeness (Fischer, 2000;Barr, 2001;Arnold et al., 2003).
Examples (1)-(6) illustrate the use of FP to mark repetitions (1), repairs (2), breaks (3) and hesitations (4) (the last one often used to bridge word finding problems). FPs can also express astonishment (5), excitement or negative sentiment (6). Extralinguistic reasons also come into play, such as the lack of concentration due to fatigue or distraction, which might lead to a higher ratio of FP in the discourse.
(1) I will uh I will come tomorrow.
(2) I will leave on Sat uh on Sunday.
(3) I think I uh have you seen my wallet? (4) I have met Sarah and Peter and uhm Lara. (5) Sarah is Michael's sister. Uh? Really? (6) A: He cheated on her. B: Ugh! That's bad! The role of fillers in spoken language has been discussed in the literature (for an overview, see Corley and Stewart (2008)). Despite this, work on processing disfluencies in NLP has mostly considered them as mere performance phenomena and focused on disfluency detection to improve automatic processing (Charniak and Johnson, 2001;Johnson and Charniak, 2004;Qian and Liu, 2013;Rasooli and Tetreault, 2013;Rasooli and Tetreault, 2014). Far fewer studies have focused on the information that disfluencies contribute to the overall meaning of the utterance. An exception are Womack et al. (2012) who consider disfluencies as extra-propositional indicators of cognitive processing.
In this paper, we take a similar stand and present a study that investigates the use of filled pauses in informal spoken German youth language and in written, but conceptually oral text from social media, namely Twitter microblogs. 1 We compare the use of FP in computer-mediated communication (CMC) to that in spoken language, and present quantitative and qualitative results from a corpus study showing similarities as well as differences between FP in both the spoken and written register. Based on our findings, we argue that filled pauses in CMC are words with extra-propositional meaning.
The paper is structured as follows. Section 2 gives an overview on the different properties of spoken language and written microblogs. In section 3 we present the data used in our study and describe the annotation scheme. Section 4 reports our quantitative results which we discuss in section 5. We complement our results with a qualitative analysis in section 6, and conclude in section 7.

Filled Pauses in Spoken and Written Registers
Clark and Fox Tree (2002) propose that FP are words with meaning, but so far there is no conclusive evidence to prove this. While experimental results have shown that disfluencies do affect the comprehension process (Brennan and Schober, 2001;Arnold et al., 2003), this is no proof that listeners have access to the meaning of a FP during language comprehension but could also mean that FP are produced "unintentionally [...], but at predictable junctures, and listeners are sensitive to these accidental patterns of occurrence." (Corley and Stewart, 2008), p.12.
To show that fillers are words in a linguistic sense, i.e. lexical units that have a specific semantics that is understandable to both speaker and hearer, one would have to show that speakers are able to produce them intentionally and that recipients are able to interpret the intended meaning of a filler.
Assuming that fillers are not linguistic words but simply noise in the signal, caused by the high demands on cognitive processing in spoken online communication, we would not expect to find them in medially written communication such as usergenerated content from social media, where the production setting does not put the same time pressure on the user as there is in oral face-to-face communication. However, a search for fillers on Twitter 2 easily proves this wrong, yielding many examples for the use of FP in medially written text (7).
(7) Oh uh.. I got into the evolve beta.. yet I have no idea what this game is.. uhm..
Both, informal spoken dialogues and microblogs can be described as conceptually oral, meaning that both display a high degree of interactivity, signalled by the use of backchannel signals and question tags, and are highly informal with grammatical features that deviate from the ones in the written standard variety (e.g. violations of word order constraints, case marking, etc.). Both registers show a high degree of expressivity, e.g. interjections and exclamatives, and make use of extra-linguistic features (spoken language: gestures, mimics, voice modulation; microtext: emoticons, hashtags, use of uppercased words for emphasis, and more).
Differences between the two registers concern the spatio-temporal setting of the interaction. While spoken language is synchronous and takes place in a face-to-face setting, microblogging usually involves a spatial distance between users and is typically asynchronous, but also allows users to have a quasi-synchronous conversation. 3 Quasisynchronous here means that it is possible to communicate in real time where both (or all) communicating partners are online at the same time, tweeting and re-tweeting in quick succession, but without the need for turn-taking devices as there is a strict firstcome-first-serve order for the transmission of the dialogue turns. As a result, microblogging does not put the same time pressure on the user but permits them to monitor and edit the text. This should rule out the use of FP as markers of disfluencies such as repairs, repetitions or word finding problems, and also the use of FP as strategic devices to negotiate who takes the next turn. Accordingly, we would not expect to observe any fillers in written microblogs if their only functions were the ones specified above.
However, regardless of the limited space for tweets, 4 microbloggers make use FP in microtext. This suggests that FP do indeed serve an important communicative function, with a semantics that must be accessible to both the blogger and the recipient.

Annotation Experiment
This section describes the data and setup used in our annotation experiment.

Data
The data we use in our study comes from two different sources. For spoken language, we use the KiezDeutsch-Korpus (KiDKo) (Wiese et al., 2012), a corpus of self-recordings of every-day conversations between adolescents from urban areas. All informants are native speakers of German. The corpus contains spontaneous, highly informal peer group dialogues of adolescents from multiethnic Berlin-Kreuzberg (around 266,000 tokens excluding punctuation) and a supplementary corpus with adolescent speakers from monoethnic Berlin-Hellersdorf (around 111,000 tokens). On the normalisation layer where punctuation is included, the token counts add up to around 359,000 tokens (main corpus) and 149,000 tokens (supplementary corpus).
The first release of KiDKo (Rehbein et al., 2014) includes the transcriptions (aligned with the audio files), a normalisation layer, and a layer with partof-speech (POS) annotations as well as non-verbal descriptions and the translation of Turkish codeswitching.
The data was transcribed using an adapted version of the transcription inventory GAT 2 (Selting et al., 1998), also called GAT minimal transcript, which uses uppercased letters to encode the primary accent and hyphens in round brackets to mark silent pauses of varying length.
The microblogging data consists of Germanlanguage Twitter messages from different regions of Germany, and includes 7,311,960 tweets with 105,074,399 tokens. For retrieving the tweets we used the Twitter Search API 5 which allows one to specify the user's location by giving a latitude and a longitude pair as parameters for the search. Over a time period of 6 months we collected tweets from 48 different locations. 6 The corpus was automatically augmented with a tokenisation layer and POS tags. 7 A string search in both corpora, looking for variants ofäh andähm (including upper-and lowercased spelling variants with multipleä, with and without a h, and with one or more m) shows the following distribution (Table 1). Filled pauses are far less frequent in microblogs compared to spoken language, but due to the large amount of data we can easily extract more than 10,000 instances from the Twitter corpus. Note that the tweets in our corpus come from different registers like news, ads, public announcements, sports, and more, with only a small portion of private communication. When constraining the corpus search to the subsample of private tweets, we will most likely find a higher proportion of FP in the social media data.
In summary, we observe a higher amount of FP in spoken language than in Twitter microblogs. However, in both corpora variants ofäh outnumberähm by roughly the same factor. This observation is compatible with the results of (Womack et al., 2012) who report that around 60% of the FP in their corpus of English diagnostic medical narratives are nasal filled pauses (uhm, hm) and around 40% are non-nasal (uh, er, ah).  Table 2: Labels used for annotating the fillers (B: between utterances; I: integrated in the utterance).

Annotating Fillers in Spoken Language and in Microtext
To be able to compare the use of fillers in spoken language with the one in Twitter microtext, we extract samples from the two corpora including 500 utterances/tweets with at least one use ofäh and 500 tweets with at least one instance ofähm. At the time of the investigation, the transcription of KiDKo was not yet completed, and we only found 360 utterances including anähm in the finished transcripts. For annotation, we used the BRAT rapid annotation tool (Stenetorp et al., 2012). Our annotation scheme is shown in Table 2. We distinguish between different categories of fillers, namely between FP that mark repetitions, repairs, hesitations, or that occur at the end of an unfinished utterance/tweet (breaks). We also annotated variants ofäh andähm which were used as question tags or interjections, but do not consider them as part of the disfluency markers we are interested in. The Unknown label was used for instances which either do not belong to the filler class and shouldn't have been extracted, such as example (8), or which couldn't be disambiguated, usually due to missing context.

(8) Hääähähh !!!
Each filler is labelled with its category and position. By position we mean the position of the filler in the utterance or tweet. Here we distinguish between fillers which occur between (B) utterances/at the beginning or end of tweets (example 9b) and those which are integrated (I) in the utterance/tweet (9a). The numbers in the first column of Table 2 correspond to examples (1)-(6).

Inter-Annotator Agreement
The data was divided into subsamples of 100 utterances/tweets. Each sample was annotated by three annotators. Table 3 shows the inter-annotator agreement (Fleiss' κ) on the KiDKo and Twitter samples. We report agreement for all but three samples which we used to train the annotators, refine the guidelines and to discuss problems with the annotaton scheme.
As we had only 360 instances ofähm from KiDKo, we divided them into three samples with 100 utterances and a fourth sample with 60 utterances. Table 3 shows that the annotation of fillers is not an easy task. The disagreements in the annotations concern both the category and the position of the FP. In some cases the annotators agree on the label but disagree on the position of the filler (10a). This can be explained by the fact that spoken language (and sometimes also tweets) does not come with sentence boundaries, and it is often not clear where we should segment the utterance. In example (10a) two annotators interpreted the reparandum as part of the utterance and thus assigned REPAIR-I, while the third annotator analysed am Samstag (on Saturday) as a new utterance, resulting in the label REPAIR-B. More often, however, the disagreements concern the category of the filler, as in (10b) where two annotators analysed the utterance as a repair while the third annotator interpreted it as a break followed by a new start. The results show that the annotation of fillers in KiDKo seems to be much harder, with average κ scores around 0.1 lower than for the tweets. Table 4 shows that the ranking for the different categories ofäh andähm is the same in both corpora (11). Hesitations are the most frequent category marked byäh andähm, followed by repairs and breaks. Repetitions are less frequent, especially in the written microblogs, as areäh andähm as question tags and interjections.  However, we can also observe a substantial difference between the spoken and the written register. In the latter one, the two most frequent categories, hesitations and repairs, make up for more than 90% of all instances ofäh andähm, while in spoken language these two categories only account for 76-77% of all occurrences of the two fillers. A possible explanation is that breaks and repetitions in spoken language are either performance phenomena or caused by discourse strategies (e.g. floor-holding) which are both superfluous in asynchronous written communication. This still leaves us with the question why hesitations and repairs do occur in written text at all. We will come back to this question in section 6.

Quantitative Results
The next question we ask is whether the two forms,äh andähm, are used interchangeably or whether the use of each form is correlated with its function. As shown in Table 5, hesitations and breaks are more often marked byähm whileäh occurs more frequently as a marker of repairs and repetitions. This observation holds for both the spoken and the written register. 72.8% and 80.0% of all instances ofähm occur in the context of a hesitation in KiDKo and Twitter, while only 59.0% (KiDKo) and 65.8% (Twitter) of the non-nasal fillersäh are used to mark a hesitation. A Fisher's exact test shows that for hesitations and repairs, the differences are statistically significant with p < 0.01 and p < 0.05, while for breaks and repetitions, the differences might be due to chance.
Next we look at the syntactic position where those fillers occur in the text. We would like to know how often FP are integrated in the utterance and how often they occur between utterances.  Fox et al. (2010) present a cross-linguistic study on self-repair in English, German and Hebrew, and observe that self-corrections in English often include the repetition of whole clauses, i.e. English speakers "recycle" back to the subject pronouns (Fox et al. 2010(Fox et al. :2491. In their German data this pattern was less frequent. Fox et al. (2010) conclude that morpho-syntactic differences between the languages have an influence on the self-repair practices in the speakers.
Our findings are consistent with Fox et al. (2010) in that we mostly observe the repetition of words, not of clauses (Table 6). Nearly all fillers which mark repetitions are integrated in the utterance or tweet, only a few occur between utterances/tweets. Fillers as markers of repairs are also mostly integrated.
For hesitations, the most frequent category, we get a more diverse picture. In our spoken language data,äh andähm are more often integrated in the utterance, while for tweets FP as hesitation markers mostly appear at the beginning or end of the tweet.
So far, our quantitative investigation showed some striking similarities in the use of filled pauses in the two corpora. In both registers, the ranking of the different disfluency types marked by the FP were the same. Furthermore, we showed that speakers/users are sensitive to the surface form of a FP and prefer to useäh in repairs andähm in hesitations, regardless of the medium they use for communication.

Discussion
In this section we will look at related work on FP and try to put our findings into context. Previous work on the difference between nasal and non-nasal fillers (Barr, 2001;Clark and Fox Tree, 2002) has described nasal fillers such as uhm, hm as indicators of a high cognitive load, while their non-nasal variants indicate a lower cognitive load during speech production. Clark and Fox Tree (2002) have proposed the filler-as-word hypothesis, stating that FP like uh and uhm are words in a linguistic sense with the basic meaning that a minor (uh) or major (uhm) delay in speaking is about to follow. This analysis is based on a corpus study showing that silent pauses following a nasal filler are longer than silent pauses after a non-nasal filler. Beyond the basic meaning, FP can have different implicatures, depending on the context they are used in, such as indicating that the speaker wants to keep the floor, is planning the next (part of the) utterance, or wants to cede the floor. To illustrate this, Clark and Fox Tree (2002) use goodbye which has the basic meaning "express farewell" but, when uttered while someone is approaching the speaker, can have the implicature "Go away".
We take the filler-as-word hypothesis of Clark and Fox Tree (2002) as our starting point and see how adequate it is to describe the use of FP in written microblogs (section 6). However, we try to avoid the term implicature which seems problematic in this context, as we are not dealing with implicatures built on regular lexical meanings but rather with implicatures on top of non-propositional meaning. As a side-effect, the implicatures based on filled pauses are not cancellable.
The analysis of Clark and Fox Tree (2002) is not uncontroversial (see, e.g., Womack et al. (2012) for a short discussion on that matter). O' Connell and Kowal (2005) criticise that the corpus study of Clark and Fox Tree (2002) is based on pause length as perceived by the annotators (instead of being analysed by means of acoustic measurements). Furthermore, it might be possible that the semantics of FP to indicate the length of a following delay only applies to English. Belz and Klapi (2013) have measured pause lengths after nasal and nonnasal fillers in German L1 and L2 dialogues from a MAP task and could not find a similar correlation between filler type and pause length.
In summary, it is not clear whether the different findings are due to methodological issues, or might be particular to certain languages and text types. Shriberg (1994), p.130 suggests that for English, models of disfluencies based on the ATIS corpus, a corpus of task-oriented dialogues about air travel planning, might not be able to predict the behaviour of disfluencies in spoken language corpora with data recorded in a less restricted setting.
The MAP task corpora used in Belz and Klapi (2013), for example, includes dialogues where one speaker instructs another speaker to reproduce a route on a map. Due to the functional design, the content of the dialogues is constrained to solving the task at hand and thus the language is expected to differ from the one used in the London-Lund corpus (Svartvik, 1990), a corpus of personal communication, that was used by Clark and Fox Tree (2002).
Fox Tree (2001) presents a perception experiment showing that uh helps recognizing upcoming words, while the nasal um doesn't. In our study we found a strong correlation between the category of the filler and its form (nasal vs. non-nasal). Nasal fillers were mostly used in the context of hesitations, which is consistent with their ascribed basic function as indicators of longer pauses (Clark and Fox Tree, 2002). The tendency to useäh within repairs might be explained by Fox Tree (2001)'s findings that non-nasal fillers help to recognise the next word. Thus, we would expect a preference for non-nasal FP to be used as an interregnum before the repair.
Other evidence comes from Brennan and Schober (2001) who present experiments where the subjects had to follow instructions and select objects on a graphical display. They showed that insertions of uh after a mid-word interruption in the instruction helped the subjects to correctly identify the target object, as compared to the same instruction where the filler was replaced by a silent pause. They conclude that fillers help to recover from false information in repairs. 8 So far, our findings are consistent with previous work outlined above, but do not rule out other explanations. A major argument against the analysis of FP as linguistic words is that so far there is no conclusive evidence that speakers do produce them intentionally (Corley and Stewart, 2008).
Our corpus study provides this evidence by showing that FP in CMC are produced deliberately and intentionally. Furthermore, we observed a statis-tically significant correlation between filler form (nasal or non-nasal) and filler category, which also points atäh andähm being separate words with distinguishable meanings.
In the next section, we show that FP in CMC can add a subtext to the original message that can be understood by the recipients, and that the information they add goes beyond the contribution made by nonverbal channels such as facial expressions or gestures. We illustrate this, based on a qualitative analysis of our Twitter data.

Extra-propositional Meaning of FP in Social Media Text
New text from social media provides us with a good test case to investigate whether filled pauses are words with (extra-propositional) meaning, as the production of written text is to a far greater extent subject to self-monitoring processes. This means that we can confidently rule out that the use of fillers in tweets is due to performance problems caused by the time pressure of online communication. Another important point is that communication on Twitter is not synchronous but can be timedelayed and works on a first-come-first-serve basis. This is quite important, as it means that we can also exclude the discourse-strategic functions of FP (e.g. floor-holding and turn-taking) as possible explanations for the use of fillers in user-generated microtext.
We conclude that there have to be other explanations for the use of filled pauses as markers of hesitations and repairs in microblogs. Consider the following examples (12) Freund. friend. "I'm asking for uhm a friend." The fillers in the examples above add a new layer of meaning to the tweet which results in an interpretation different from the one we get without the filler. While a simple "Congratulations!" as answer to the message "I'm married now" would be interpreted as a polite phrase, the mere addition of the filler implies that this tweet should not be taken at face value and has a subtext along the lines "Actually, I really feel sorry for you". The same is true for (13) where the subtext can be read as "In fact, we're talking about some other bodyparts here". In example (14), the subtext added by the filler will most probably be interpretated as "I'm really asking for myself but won't admit it". 9 In the next examples (15)- (17), also hesitations, the filler is used to express the author's uncertainty about the proposition. Thus, the most general commonality between the examples above is that the speaker does not make a commitment concerning the truth content of the message.
The following examples (18)-(21) show instances ofäh andähm in repairs where the FP occur as interregnum between reparandum and repair. 10 I will leave you on Sat uh on Sunday REPARANDUM INTERREGNUM REPAIR 9 In fact, this adds an interesting meta-level to the utterance, as by inserting the filler the author draws attention to the fact that there is something she seemingly wants to hide. 10 We follow the terminology of Shriberg (1994).
The tweet author enacts a slip of the tongue, either by using homonymous or near-homonymous words (Diskus (discus) -Discos (discos), hängst (hang) -Hengst (stallion)) or by using analogies and conventionalised expressions (off -on, resist -contradict). The "mistake" was made with humorous intention and is then corrected. The filler takes again the slot of the interregnum and serves as a marker of the intended pun. In the next set of examples, (22)-(24), a taboo word or word with a strong negative connotation is reformulated into something more socially acceptable (minister of propaganda → district mayor; madness → spirit; tantalise → educate). Often, this is done with a humorous intention, but also to express negative sentiment (e.g. in (22)  These examples show that the use ofäh andähm in tweets is intentional and highly edited. The two forms are used to express the speaker's uncertainty about the propositional content of the message, or as a signal that the speaker does not warrant the truth of the message. Other functions include the use of fillers as markers of humorous intentions and of negative sentiment (see Table 7). Note that the meanings are not necessarily distinct but often overlap.
We thus argue that FP in user-generated content from social media are linguistic words that are produced intentionally and have an extra-propositional meaning that can be understood by the recipients.

Meaning
Description

UNCERTAINTY
Speaker is uncertain about the propositional content TRUTH CONTENT Speaker does not warrant the truth content of the proposition HUMOR Marker of humorous intention EVALUATION Marker of negative sentiment Table 7: Extra-propositional meaning of fillers in CMC.

Conclusions
The results from our corpus study show that fillers in user-generated text from social media are linguistic words that are produced intentionally and function as carriers of extra-propositional meaning. This finding has consequences for work on Sentiment Analysis and Opinion Mining in social media text, as it shows that FP are used as a marker of irony and humour in Twitter, and also indicate uncertainty and negative sentiment. Thus, filled pauses might be useful features for irony detection, sentiment analysis, or to assess the strength of an opinion in online debates.