Functions of Silences towards Information Flow in Spoken Conversation

Silence is an integral part of the most frequent turn-taking phenomena in spoken conversations. Silence is sized and placed within the conversation flow and it is coordinated by the speakers along with the other speech acts. The objective of this analytical study is twofold: to explore the functions of silence with duration of one second and above, towards information flow in a dyadic conversation utilizing the sequences of dialog acts present in the turns surrounding the silence itself; and to design a feature space useful for clustering the silences using a hierarchical concept formation algorithm. The resulting clusters are manually grouped into functional categories based on their similarities. It is observed that the silence plays an important role in response preparation, also can indicate speakers’ hesitation or indecisiveness. It is also observed that sometimes long silences can be used deliberately to get a forced response from another speaker thus making silence a multi-functional and an important catalyst towards information flow.


Introduction
Silence is a multifaceted natural phenomenon in human conversations that carries information rich in meaning and function. Even though "silence" is generally defined as the absence of speech (Jaworski, 1993) or a break in a conversation flow, its occurrence has the power to deliver a message, as well as trigger human response similar to any other conversational behavior. Silence in human conversations provides insights into the thought process, emotion, and attitude (Richmond et al., 1991) among others. At the same time, silence is used to convey power (dominance) (Saunders, 1985;Tannen, 1990), respect, and manage conflicts.
Along with speech, silence is an integral part of human interaction, and the two complement and provide information about each other. In the words of Bruneau (Bruneau, 1973): "Silence is to speech as the white of this paper is to this print" - Thomas J Bruneau. Given that the reasons for silence are limitless, it also has many functions. One function is "eloquent silences" that includes the use of silence in the funeral, at religious ceremonies, as a legal privilege, or in response to a rhetorical question (Ephratt, 2008). Apart from this, silence can be used to indicate topic avoidance, lack of information to provide response, agreement, disagreement, anger, frustration, uncertainty, hesitancy and others.
Over the years, researchers have studied silence with respect to, but are not limited to, the location of silence in a conversation (Richmond et al., 1991;Jensen, 1973) or its role in a conversation (Cappella, 1980;Zimmermann and West, 1996;McLaughlin and Cody, 1982) or how its duration changes with different emotions (Alam et al., 2016). Silence has also been studied as a method for non-verbal communication (Kogure, 2007;Bruneau, 2008) and its practices in different cultures (Richmond et al., 1991), or in different contexts. It has also been observed as a powerful tool for conflict-management (Oduro-Frimpong, 2007), and within the context of psychotherapy (Frankel et al., 2006;Gale and Sanchez, 2005;Ladany et al., 2004;Ronningstam, 2006).
Unlike research on speech, the studies on silence are either definitional (theoretical) or de-  scriptive. Even within speech research communities, there are very few studies that have analyzed function of silence in a methodological manner. Generally, in a dialog system, silence is not acknowledged as a form of interaction, but rather its function in a conversation is seen as a "pause" or a "gap". Whereas speech is viewed as the primary carrier of information. Thus, a further study of silence and its functions is important, as silence often does serve as a message, or at least as means that offers contextual cues to the surrounding speech.
Therefore, the goal of this paper is to analyze the function of long silences 1 occurring between and within speakers in dyadic spoken conversations. Our focus is on understanding the perceived reasons of such functions towards the information flow in spoken conversations.
The paper is organized as follows. An overview of the experimental methodology used in this study is given in Section 2. We present an analysis of our findings in Section 3 and provide conclusions in Section 4.

Methodology
The methodology followed for grouping functions of the long silences is shown in Figure 1. The pipeline includes a data preparation phase, which involves extraction and selection of the long silence instances; followed by the feature design and extraction phase. The next phase in the pipeline is unsupervised clustering of the selected silence instances, which are later grouped into hierarchical clusters for the analysis of their functions.

Data Preparation
For this study, we have used conversations from the SISL Human-Human Conversational Discourse Corpus. The data is a subset of a large Italian call-center corpus where call center agents are engaged in conversations with real customers. The customers are calling to solve some specific problem or seek information. The inbound Italian phone conversations are recorded on two separate audio channels with a quality of 16 bits, 8kHz sample rate. The collected conversations (≈ 10K) have an average duration of 396.6±197.9 seconds.
To analyze the role of silence in information flow of the conversation, we have selected 10 conversations that contain manual dialog act annotations (Chowdhury et al., 2016b) following dialog Act Markup Language (DiAML) (Bunt et al., 2010(Bunt et al., , 2012 annotation scheme. The details of the dimensions and the communicative functions considered for the annotation are given in Table  1. The dimensions such as: Discourse Structuring, Speech and Turn Management dimensions are mapped to the tag Other, as they are very infrequent.

Extraction of Silence
Silence positions, as well as turn types, such as speakers' turns, overlapping turns, are extracted using the turn segmentation and labeling system (Chowdhury et al., 2016a). The input to the system is the audio of the conversation, the manual transcription and speaker information.
The forced-aligned transcription is obtained using an in-domain Automatic Speech Recognition (ASR) (Chowdhury et al., 2014). Lexical information from these forced-aligned transcripts is used to extract turn-taking sequences. The pipeline uses the time aligned output as tokens to create Inter-Pausal Units (IPUs) for each input channel. IPUs are defined as segments of consecutive tokens with no less that 50 ms gaps in-between. Using the time information of inter-IPUs and intra-IPUs, we then define steady conversation segments where each segment maintains a steady time-line for both interlocutor channels. The labels of each silence segment are then defined by a set of rules as follows: • Pause (P ): Gaps between the turns of the same speaker with no less than 0.5 second. P A and P C represent agent and customer's pauses respectively.
• Lapse between speakers (L B ): Floor switches between the speakers with a silence duration of 2 seconds or more.
• Lapse within speaker (L W ): Gaps between the same speakers' turns with a silence duration of 2 seconds or more.
• Switch (SS): Floor switches between the speakers with silence duration of less than 2 seconds or with overlapping frames not more than 20 ms. This category is also know as gaps.
The labeled turn sequences are then used to select silence instances for the analysis.

Silence Filtering
From the 10 conversations we have extracted 433 instances of silences with the duration greater or equal to 1 second. The instances are categorized into two groups: • Between-Speaker Silences (B): These instances of silence include gaps between different speaker turns that are greater or equals to 1 second. B = {S l , L B }, where S l stands for gaps >= 1 second and < 2 seconds where as L B are lapse between speakers >= 2 seconds.
• Within-Speaker Silences (W ): These instances of silence include pauses between the same speaker's turns that are greater or equals to 1 second. W = {P l , L W }, where P l stands for pauses >= 1 second and < 2 seconds where as L W are lapse within speaker >= 2 seconds.
For the initial analysis, the instances of long silences that occur after or before overlapping speech (61 silence instances) are ignored. As a result, the analysis is performed on 372 instances.

Feature Design and Extraction
Even though silence is an inherently valueless phenomena that possesses no function on its own, individual instances of silence gain its meaning and function from the surrounding context. Consequently, modeling functions of silences requires conceptualization of the context and features that capture it. Dialog acts carry specific communicative functions such as question, answer, expression of agreement, disagreement, etc. Since dialog acts are assigned to the speech segments (turns) that surround the long silences, they provide the information that could be used to model the context of silence instances.

Social Obligations Management
Initial-Greeting, Return-Greeting Initial-Self-Intro, Return-Self-Intro Apology, Accept-Apology Thanking, Accept-Thanking Initial-Goodbye, Return-Goodbye The dialog act dimensions and communicative functions listed in Table 1 are used as features for the analysis of between and within speaker silence instances. Each turn preceding or following a silence is transformed into a feature vector using one-hot representation for dialog acts.
The vectors encode information such as the following. Feedback, a joined dimension of autofeedback and allo-feedback, (f b) = {0, 1}, where f b=0 represent the absence of feedback dialog acts in the turn and vice-versa. Similarly, the vector also includes other dialog act dimensions like Time Management (tm), and Social Obligations Management (s). The General dimension is split into two: (a) information seeking (q) and (b) information providing and action discussion functions (ac). The motivation behind such a split is to distinguish between information seeking dialog acts which impose an expected pattern on its recipient, i.e an obligation to provide the requested information.
Since according to the DiaML annotation standard a turn can contain several dialog acts, the vector representation specifically encodes the last dialog act of the preceding turn (lact) and the first dialog act of the turn following the long silence (f act). Both lact and f act ∈ χ, where as χ = {Ac, Q, F, TimeM, Ap, Thank, Int, Other, None}. In the set, χ, Ac represents communicative functions from information providing and action discussion functions; Q represents Information Seeking functions; F represents Feedback (autofeedback and allo-feedback) functions; Apo represents apology and accept-apology functions; Thank represents thanking and accept-thanking; Int represents initial and return greetings, selfintroductions, and goodbyes; Other represents all the dialog acts not used for the analysis. None, on the other hand, indicates absence of dialog acts.

Unsupervised Annotation of Silence Function
The described representation of silence instances is applied for clustering using Cobweb clustering algorithm (Fisher, 1987) -a well-known concept formation system designed to model human concept learning. Cobweb constructs clusters using "concept hierarchy" that optimally and incrementally accounts for the observed regularities on a set of instances. In other words, given a set of silence instances, Cobweb discovers a classification scheme that covers the patterns with respect to provided feature vectors. Instead of forming concepts at a single level of abstraction, Cobweb groups instances into a classification tree where leaves represent similar instances, and internal nodes represent broad concepts. The generality of a broader concept increases as the nodes approach the root of the tree. Each cluster is characterized with a probabilistic description. The classification tree is constructed incrementally inserting the instances into the tree one by one. When adding an instance, the algorithm tra-verses the tree top-down starting from the root of the tree. At each node, there are four possible operations: (a) insert (b) create (c) merge and (d) split. These operations are selected with respect to the highest category utility (CU ) metric (Gluck and Corter, 1985). The metric is derived from the categorization studies in cognitive psychology and is shown in Equation 1.
Category utility, CU , attempts to maximize both (a) the probability of the instances in the same category to have feature values in common; and (b) the probability of the instances in different categories to have different feature values.
In the equation, P r[f i = v ij ] represents the marginal probability that feature f i has value v ij , whereas P r[f i = v ij |C l ] represents the conditional probability that feature f i has value v ij , given the instance belongs in cluster C l . CU (C l ) estimates the quality of individual cluster.
To measure the quality of overall clustering of the silences, we calculate the average category utility function CU (C 1 , C 2 , .., C k ), as shown in Equation 2.
In the equation, k is the total number of categories. The overfitting is controlled by 1 k . Therefore, for each set (B,W ), we applied Cobweb clustering algorithm implemented in (Hall et al., 2009) with acuity A = 1.0 and cutoff threshold of C = 0.0028.

Resulting Clusters
For between-speakers silences (B), we have obtained 24 leave clusters, whereas for withinspeaker silences (W ), we have obtained 26 leave clusters. The distribution of dialog act sequences in each cluster is given in Tables 2 and 3.

Categorization of Silence Functions
Assuming that each cluster represents a specific function of a silence, the clusters are manually grouped with respect to their parents in the classification tree. The manual grouping of silence clusters is performed considering conversation scenarios surrounding the silence events. For instance, Table 2: Preceding (column two) and succeeding (column three) turn communicative function sequences for each clusters and their frequency inside parenthesis for between speaker silences. The first column of the table represents the classification tree's leaf id of the corresponding cluster.
• A mode of information flow (IF ): These silences can either be a: 1) conversational silences where both speakers are exchanging information or feedback 2) forced silences (deliberate 2 ), where the current speaker is using a silence as a tool to force the interlocutor to respond. The member clusters of this group are IF={11,13,15,17,18,19,22,25,29,37, 38}.
• Silences in Other categories (B − Oth): These are the silences which are motivated by factors, such as discourse structuring, not considered in the study. This group includes clusters B-Oth={43, 45, 46, 47}.
The above-mentioned categories are presented in Examples 1 and 2. In Example 1, we observe that the caller is asking the call center operator a reason behind an action, and the act is followed by a long silence of 1.41 seconds. After the interval, the operator is passing some information regarding the earlier query by the caller. From the operational point of view, the interval might have been used to either acquire information or to structure it. Similarly, in RP 2 scenario in Example 1, after the operator informs that the 'electric power' will not be activated, the caller is taking a long silence of 1.38 seconds to respond to the given information, asking another question. This silence could have been again used for preparing the answer, or it might be the time taken by the responding speaker to compose the next action. In Example 2, we present a scenario where the silence category IF is used deliberately to force another speaker to reply.
The silence in both examples may play other cognitive functions such as controlling emotional attitudes. However, as the focus of this study is to understand the function of long silences in the information flow, these cognitive functions are not considered. Example 1. Example of silence category RP : RP 1 caller: al distacco perfetto ora eh eh su che base mi perdoni caller: the complete interruption ... perfect! now ehm ehm due to what reason, excuse me? (1.14) Category -RP 1 operator: ah ascolti qui ci sono una serie di fatture malgrado operator: Listen (please) we have here a number of unpaied bill in spite of Example of silence category RP : RP 2 operator: la luce non gliela riprist non viene ripristinata operator: the electric power will not be reactiv will not be reactivated ( The Within-Speaker Silence cluster groups are: • Organizational silence (CS): The long pause used for the purpose of organizing the information flow in the conversation This group contains clusters of silences where a speaker is providing information and the silence between turns can either be a time taken to think, find information, or to compose and plan the next turn. CS={2,3,19,20,21,22,23,24,25,29,33,34,39}.
• Indecision or Hesitation silence (H): In this groups of silences, speaker is either confused about some information, needs clarification, or have some queries. The member clusters of this groups are H={7, 8, 9, 10, 11, 12, 36 In Example 3, we present dialog scenarios with assigned categories. It is observed that the top three long silence intervals are used either to plan the next turn or to take the time to think. Whereas in the last (shortest) silence of 1.01 second, before threatening the operator, the caller either hesitates, feels bad, or is not sure whether a threat is going to work.
The duration distribution statistics for each category of silence functions are presented in Tables 4  and 5. For between-speaker silence categories, in Table 4, it is observed that median duration of silence category RP 2 along with B−Oth are longer compared to RP 1 and IF . As for within-speaker silence categories, it is observed that median duration of H categories is longer than CS. The observation is explained as the speakers might need more time to take the next turn when s/he is facing indecision, hesitation, or need clarification about something.

Conclusion
The main focus of this analytical study is to explore the functions of long silence within and between speakers towards the information flow in a conversation. In an attempt to find such functions, the study utilizes the sequences of dialog act tags present in the left and right context (concerning speaker turns) surrounding the silence itself, and designs feature vector to represent a long silence. These designed feature vectors are later used to cluster silences using a well-known hierarchical concept formation system (Cobweb), which is designed to model different aspects of human concept learning. Following the clustering, we have manually grouped the clusters into functional categories and have studied their significance, and duration distribution.
The functions of silence we observe vary from response preparation to hesitation to asking some queries. It is also observed that sometimes these long silences are used deliberately to get a forced response from another speaker. It can also indicate the indecisiveness of the current speaker.
Even though most of the research from speech communities ignores the silences, our observation shows that by considering the function of long silences, we can better understand the information flow in the conversation. As silences do contribute to explaining the information presented by the speech signals. Silence also has the potential to explain long term behavioral traits and short term states.
This study is our first attempt to analyse, understand and group functions of long silence in dyadic conversations. The observed functions, such as hesitations, are also related to another speech phenomenon -disfluencies. In future work we plan to address the relationship between speech disfluencies and long silences. This analysis will help us to understand the factors and contexts that represent cues of the silence function which is indeed necessary to design computational models for such a simple yet informative event of human conversation.