Improving User Impression in Spoken Dialog System with Gradual Speech Form Control

This paper examines a method to improve the user impression of a spoken dialog system by introducing a mechanism that gradually changes form of utterances every time the user uses the system. In some languages, including Japanese, the form of utterances changes corresponding to social relationship between the talker and the listener. Thus, this mechanism can be effective to express the system’s intention to make social distance to the user closer; however, an actual effect of this method is not investigated enough when introduced to the dialog system. In this paper, we conduct dialog experiments and show that controlling the form of system utterances can improve the users’ impression.


Introduction
Demand for a spoken dialog system has raised, including AI speakers or personal assistant systems (Bellegarda, 2014). Not only the conventional task-oriented dialog systems (Aust et al., 1995;Zue et al., 2000), but also non-task-oriented systems (Bickmore and Picard, 2005;Meguro et al., 2010;Yu et al., 2016;Akasaki and Kaji, 2017) have attracted the attention in recent years. In order for such dialog systems to become ubiquitous in the society, it is important to improve the user impression to the dialog with the system. Miyashita et al. (2008) conducted a research that increases the user's intention to talk with the system by gradually increasing the behavior of a robot that expresses intimacy. Their study showed that the user felt the robot more friendly and increased desire to use the robot continuously by the robot's behavior. This research showed that, ex-pressing intimacy with the user is effective to promote the user's desire to use the system.
In this research, we focused on a linguistic form of system utterances to improve the user impression. Several languages, including Japanese, have a mechanism called "honorifics" by which the speech form changes according to the relative social position or closeness of the social distance to the dialog partner (Brown and Ford, 1961). The honorific is often treated as one of the categories of politeness Levinson, 1978, 1987) although several arguments have been raised (Ide, 1989;Agha, 1994). Brown and Levinson (1987) claimed that the speaker can choose strategy according to the politeness level depending on the social distance or relative power between the speakers. In Japanese, the speakers try to close the social distance by gradually decreasing the use of honorific form.
This paper examines effectiveness of introducing such mechanism to the dialog system. Kim et al. (2012) conducted experiments of human-robot interaction in Korean language, and indicated that the robot is perceived more friendly when calling the user in the familiar form, but the effect of the speech form itself was limited. In contrast, we investigate the effect of changing speech form on the user impression including the friendliness.

Changing Form of System Utterances
Considering Social Distance 2.1 Expressions of Japanese for social distance, politeness and familiarity This study exploits the expressions of Japanese that express politeness and social distance between the talker and the listener. Thus, we first explain such mechanism of Japanese briefly. The Japanese language has a system of speaking form called "the honorifics (keigo)", that indicates social relationships between the speaker and the listener or the speaker and the persons referred in the utterance using the linguistic form. For example, the verb tsukuru (to make) can be used as either tsukuru (normal form) or tsukuri-masu (polite form). Another way of expressing closeness is to use the ending particles, such as tsukuri-masu (polite, far) or tsukuri-masu-yo (polite, closer). In addition to the honorifics, it is possible to express closeness using different wording, such as hai (a positive answer or a backchannel, polite) and un (casual). When the interlocutors are familiar with each other, the form of utterances become less polite, closer and more casual. In this experiment, we defined "honorific form" as polite, less close and formal expressions, and "normal form" as less polite, closer and casual expressions.

Gradual control of system speech form based on speech level shift
The changes of the speech form are caused by several factors, such as the social entrainment (Hirschberg, 2008). One of the main factors is the changes of the social distance. When two persons make conversations several times, it was mentioned that the proportion of honorific form decreases, and that of normal form increases as they make more conversations (Ikuta, 1983). This phenomenon is called "speech level shift" or "speech style shift" (Ikuta, 1983;Hasegawa, 2004). The "speech level" or "speech style" means the expressions in the utterances that express closeness of the interlocutors. Thus, the "speech level shift" means the switching of speech level that occurs in conversations between the same persons.
To make the dialog system express that the system and the user gradually become more friendly, we propose a method to use the speech level shift. In the experiment, the subjects talked with the system for three consecutive days and evaluated the impression on the system and the dialog with the system. We changed the speech level step by step within the three-day experiment, as shown in Table 1. In Japanese, it is natural to use the honorific form when persons meet for the first time; thus, all of the system utterances were in the honorific form in the first conversation.  An experimental system is based on an examplebased dialog system (Takeuchi et al., 2007;Lee et al., 2009) commonly used for the nontask-oriented system. A computer-based female agent was employed. In the example-based dialog system, the system calculates the similarities between the user's utterance and example sentences in the database, and then selects a response corresponding to the most similar example. This study employed the cosine similarity for the similarity calculation.
3.2 Topic-dependent example-response database for non-task-oriented dialog The example-response databases for the experiments were constructed through the actual dialogs with the system and users (Kageyama et al., 2017). We focused on chatting between friends, which is one of the non-task-oriented dialog, and prepared four databases corresponding to the different dialog topic. To collect the dialog data, the users asked the agent what she had done yesterday on the assumption that she had led a humanlike life in the dialog collection. The topics of the database were cooking, movies, and meal. A dialog example is appended at Appendix A.
The number of pairs included in the constructed database was ranged from 1,000 to 1,125. The responses of the system were composed in the honorific form.

Preparation of the system utterances in normal form
The databases of the normal form were constructed by rewriting the form of the response sentences of the collected databases. 26 persons rewrote the sentences into the normal form. In the rewriting, the rewriting rules shown at Appendix B were provided to the rewriters for the consis-tency.

Experimental condition
The experiments were conducted in a sound-proof chamber for 3 consecutive days. The participants interacted with the system once a day, where a participant made 10 utterances to control the number of interchanges. The topic of the conversation was different from day to day, where the order of the topics was randomly determined from participant to participant. The rate of the system utterances in the honorific and normal form was changed according to Table 1. After the conversation, they evaluated the impression on the spoken dialog system using a questionnaire. For comparison, we prepared the dialog systems speaking in only the honorific form and the normal form in all three days. These two systems are denoted as "Honorific" and "Normal" hereafter. In the experiments, 14 participants talked with one of the three systems, and thus the total number of the participants was 42 (3 systems × 14 participants). Each group contained 7 male and 7 female participants. We first presented the participants all the topics the dialog system could handle, and the participants were instructed to ask what the agent did yesterday for the specific topic. We also presented a dialog example to the participants. Then the participants made conversation with the system on the presented topic. The participants were allowed to make self-disclosure utterances. We expected the system and the participant made conversations within the given topic, but the conversation broke down when the participant made an unanticipated utterance. The participants were instructed to talk with the system until making the specified number of utterances even when the conversation broke down.

Procedure of dialog experiments
The experimental procedure is as below: Step 1: The topic is announced to the participant.
Step 2: The participant asks the system what the agent did yesterday.
Step 3: The participant made 10 interchanges with the system.
Step 4: The participant answered a questionnaire on the impression of the dialog.  Step 5: The steps 1 to 3 were repeated for 3 consecutive days changing the topic every day

Evaluation method
At the end of the every conversation, the participants answered the following four questions using the five-grade Likert scale, one (not at all) to five (very much).
Satisfaction: How the participant was satisfied with the dialog Friendliness: How friendly the participant felt the dialog system Impression of speech form: How adequate the participant felt of the system's speech form Intention of talk: How strongly the participant wants to use the system again In addition, we asked the participants who talked with the proposed system, whether they noticed the changes of the speech form or not after the last experiment. Table 2 shows the rates of the correct answers made by the system in the experiments. The correctness was judged by the participant based on the naturalness of the response to the question. As shown in the table, the rate of correct answer of each system through three days experiments is about 70%, and this is almost equal to the previous results (Kageyama et al., 2017). From the one-way layout ANOVA factoring the condition of speech form, the significant difference was not observed. Therefore, the effect of response error in the subjective evaluation is considered to be almost equal between systems. Figure 1 shows the average scores of the subjective evaluation per day. The graph shows that the subjective scores of the proposed system tend to  increase day by day, whereas those of the "Honorific" and the "Normal" systems tend to be flat. The scores of "Proposed" and "Honorific" are almost same at the first day because the all of utterances conducted in the honorific form. Interestingly, we can observe the difference between the scores of "Proposed" and "Normal" at Day 3 even both systems spoke in the same form. This result reflects that the effect of the changing form of the utterance by number of interactions.

Experimental results of subjective evaluation
Here, we conducted the two-way layout ANOVA to compare the condition of the speech form and the number of the interaction, and obtained the significant difference at the speech form factor in Satisfaction (p ≤ 0.01, F = 3.07), Impression of speech form (p = 0.01, F = 3.07), and Friendliness (p ≤ 0.01, F = 3.07). Then, we conducted the Tukey-Kramer tests to investigate the difference between the conditions. The results are summarized in Table 3.
As shown in the table, "Proposed" surpassed "Honorific" in terms of Impression of speech form, and surpassed "Normal" in terms of Satisfaction and Friendliness. These results suggest that the proposed system tends to obtain the better subjective score comparing to the simple systems without changing the form of utterance.

Perception of changes of speech form
In the experiments, 5 out of 14 participants that used the proposed system did not perceive the changes of the speech form. Here, we compared Figure 2: Score depending on perception the scores between the groups of participants who perceived (denoted as "Perceived") and did not perceive (denoted as "Not-perceived") the changes of the form. Figure 2 shows the variation of the average scores of each group. From the figure, we can observe that all of the subjective scores of "Not-perceived" tend to increase as same with the scores of "Perceived." This result suggests that it is possible that the proposed method is able to improve the user impression unconsciously.

Conclusion
In this paper, we examined a method to improve the user impression by changing the form of system utterance according to number of uses.
The dialog experiments showed that the proposed method can improve the subjective scores, such as the satisfaction compared to the simple systems unchanging the speech form, even the user could not perceive the changes of the expression.
In a future work, we will examine a method to change the form of the sentences considering the relationship between the speakers (Li et al., 2016). Where did you run? System The jogging path by the Hirose river.

User
Did you run alone? System Yes, I ran on the jogging path by the Hirose river.

User
Have you ever participated in a marathon? System Yes, I have participated in a halfmarathon several times.