Recognition of Distress Calls in Distant Speech Setting: a Preliminary Experiment in a Smart Home

This paper presents a system to recognize distress speech in the home of seniors to provide reassurance and assistance. The system is aiming at being integrated into a larger system for Ambient Assisted Living (AAL) using only one microphone with a ﬁx position in a non-intimate room. The paper presents the details of the automatic speech recognition system which must work under distant speech condition and with expressive speech. Moreover, privacy is ensured by running the decoding on-site and not on a remote server. Furthermore the system was biased to recognize only set of sentences deﬁned after a user study. The system has been evaluated in a smart space reproducing a typical living room where 17 participants played scenarios including falls during which they uttered distress calls. The re-sults showed a promising error rate of 29% while emphasizing the challenges of the task.


Introduction
Life expectancy has increased in all countries of the European Union in the last decade. Therefore the part of the people who are at least 75 years old has strongly increased and solutions are needed to satisfy the wishes of elderly people to live as long as possible in their own homes. Ageing can cause functional limitations that -if not compensated by technical assistance or environmental management-lead to activity restriction [1] [2]. Smart homes are a promising way to help elderly people to live independently at their own home, they are housings equipped with sensors and actuators [3] [4][1] [5]. Another aspect is the increasing risk of distress, among which falling is one of the main fear and lethal risk, but also blocking hip or fainting. The most common solution is the use of kinematic sensors worn by the person [6] but this imposes some constraints in the everyday life and worn sensors are not always a good solution because some persons can forget or refuse to wear it. Nowadays, one of the best suited interfaces is the voice-user interface (VUI), whose technology has reached maturity and is avoiding the use of worn sensors thanks to microphones set up in the home and allowing hands-free and distant interaction [7]. It was demonstrated that VUI is useful for system integrating speech commands [8].
The use of speech technologies in home environment requires to address particular challenges due to this specific envi-ronment [9]. There is a rising number of smart home projects considering speech processing in their design. They are related to wheelchair command [10], vocal command for people with dysarthria [11] [8], companion robot [12], vocal control of appliances and devices [13]. Due to the experimental constraints, few systems were validated with real users in realistic situation condition like in the SWEET-HOME project [14] during which a dedicated voice based home automation system was able to drive a smart home thanks to vocal commands with typical people [15] and with elderly and visually impaired people [16].
In this paper we present an approach to provide assistance in a smart home for seniors in case of distress situation in which they can't move but can talk. The challenge is due to expressive speech which is different from standard speech: is it possible to use state of the art ASR techniques to recognize expressive speech? In our approach, we address the problem by using the microphone of a home automation and social system placed in the living room with ASR decoding and voice call matching. In this way, the user must be able to command the environment without having to wear a specific device for fall detection or for physical interaction (e.g., a remote control too far from the user when needed). Though microphones in a home is a real breach of privacy, by contrast to current smart-phones, we address the problem using an in-home ASR engine rather than a cloud based one (private conversations do not go outside the home). Moreover, the limited vocabulary ensures that only speech relevant to the command of the home is correctly decoded. Finally, another strength of the approach is to have been evaluated in realistic conditions. The paper is organised as follow. Section 2 presents the method for speech acquisition and recognition in the home. Section 3, presents the experimentation and the results which are discussed in Section 5.

Method
The distress call recognition is to be performed in the context of a smart home which is equipped with e-lio 1 , a dedicated system for connecting elderly people with their relatives as shown in Figure 1. e-lio is equipped with one microphone for video conferencing. The typical setting and the distress situations were determined after a sociological study conducted by the GRePS laboratory [17] in which a representative set of seniors were included.
From this sociological study, it appears that this equipment is set on a table in the living room in font of the sofa. In this way, an alert could be given if the person falls due to the carpet or if it can't stand up from the sofa. This paper presents only the audio part of the study, for more details about the global audio and video system, the reader is referred to [18].

Speech analysis system
The audio processing was performed by the software CIRDOX [19] whose architecture is shown in Figure 2. The microphone stream is continuously acquired and sound events are detected on the fly by using a wavelet decomposition and an adaptive thresholding strategy [20]. Sound events are then classified as noise or speech and, in the latter case, sent to an ASR system. The result of the ASR is then sent to the last stage which is in charge of recognizing distress calls.
In this paper, we focus on the ASR system and present different strategies to improve the recognition rate of the calls. The remaining of this section presents the methods employed at the acoustic and decoding level.

Acoustic modeling
The Kaldi speech recognition tool-kit [21] was chosen as ASR system. Kaldi is an open-source state-of-the-art ASR system with a high number of tools and a strong support from the community. In the experiments, the acoustic models were contextdependent classical three-state left-right HMMs. Acoustic features were based on Mel-frequency cepstral coefficients, 13 MFCC-features coefficients were first extracted and then expanded with delta and double delta features and energy (40 features). Acoustic models were composed of 11,000 contextdependent states and 150,000 Gaussians. The state tying is performed using a decision tree based on a tree-clustering of the phones. In addition, off-line fMLLR linear transformation acoustic adaptation was performed.
The acoustic models were trained on 500 hours of transcribed French speech composed of the ESTER 1&2 (broadcast news and conversational speech recorded on the radio) and REPERE (TV news and talk-shows) challenges as well as from 7 hours of transcribed French speech of the SH corpus (SWEET-HOME) [22] which consists of records of 60 speakers interacting in the smart home and from 28 minutes of the Voix-détresse corpus [23] which is made of records of speakers eliciting a distress emotion.

Subspace GMM Acoustic Modelling
The GMM and Subspace GMM (SGMM) both model emission probability of each HMM state with a Gaussian mixture model, but in the SGMM approach, the Gaussian means and the mixture component weights are generated from the phonetic and speaker subspaces along with a set of weight projections.
The SGMM model [24] is described in the following equations: where x denotes the feature vector, j ∈ {1..J} is the HMM state, i is the Gaussian index, m is the substate and cjm is the substate weight. Each state j is associated to a vector vjm ∈ R S (S is the phonetic subspace dimension) which derives the means, µjmi and mixture weights, wjmi and it has a shared number of Gaussians, I. The phonetic subspace Mi, weight projections w T i and covariance matrices Σi i.e; the globally shared parameters Φi = {Mi, w T i , Σi} are common across all states. These parameters can be shared and estimated over multiple record conditions.
A generic mixture of I gaussians, denoted as Universal Background Model (UBM), models all the speech training data for the initialization of the SGMM.
Our experiments aims at obtaining SGMM shared parameters using both SWEET-HOME data (7h), Voix-détresse (28mn) and clean data (ESTER+REPERE 500h). Regarding the GMM part, the three training data set are just merged in a single one. [24] showed that the model is also effective with large amounts of training data. Therefore, three UBMs were trained respectively on SWEET-HOME data, Voix-détresse and clean data. These tree UBMs contained 1K gaussians and were merged into a single one mixed down to 1K gaussian (closest Gaussians pairs were merged [25]). The aim is to bias specifically the acoustic model with the smart home and expressive speech conditions.

Recognition of distress calls
The recognition of distress calls consists in computing the phonetic distance of an hypothesis to a list of predefined distress calls. Each ASR hypothesis Hi is phonetized, every voice commands Tj is aligned to Hi using Levenshtein distance. The deletion, insertion and substitution costs were computed empirically while the cumulative distance γ(i, j) between Hj and Ti is given by Equation 1.
The decision to select or not a detected sentence is then taken according a detection threshold on the aligned symbol score (phonems) of each identified call. This approach takes into account some recognition errors like word endings or light variations. Moreover, in a lot of cases, a miss-decoded word is phonetically close to the good one (due to the close pronunciation). From this the CER (Call Error Rate i.e., distress call error rate) is defined as:

CER =
Number of missed calls Number of calls (2) This measure was chosen because of the content of the corpus Cirdo-set used in this study. Indeed, this corpus is made of sentences and interjections. All sentences are calls for help, without any other kind of sentences like home automation orders or colloquial sentences, and therefore it is not possible to determine a false alarm rate in this framework.

Live Experiment
An experiment was run in the experimental platform of the LIG laboratory in a room whose setting corresponds to Figure 1 and equipped with a sofa, a carpet, 2 chairs, a table and e-lio. A Sennheiser SKM 300 G2 ME2 omnidirectional microphone was set on the cupboard. In these conditions, the microphone was at a distance of above 2 meters from the speaker (Distant speech conditions). The audio analysis system consisted in the CIR-DOX software presented in Section 2 which was continuously recording and analysing the audio streams to detect the calls.

Scenarios and experimental protocol
The scenarios were elaborated after field studies made by the GRePS laboratory [17]. These studies allowed to specify the fall context, the movements during the fall as well as the person's reaction once on the floor. Phrases uttered during and after the fall were also identified "Blast! What's happening to me? Oh shit, shit!". The protocol was as follows [18]. Each participant was introduced to the context of the research and was invited to sign a consent form. The participants played four scenarios of fall, one blocked hip scenario and two other scenarios called "true-false" added to challenge the automatic detection of falls by the video analysis system. If the participant's age was under 60, he wore a simulator which hampered his mobility and reduced his vision and hearing to simulate aged physical conditions. Figure 3 shows a young participant wearing the simulator at the end of a fall scenario. The average experiment duration of an experiment was 2h 30min per person. This experiment was very tiring for the participants and it was necessary to include rehearsals before starting the recordings so that the participant felt comfortable and was able to fall securely.

Voice commands and distress calls
The sentences of the AD80 corpus [19] served as basis to develop the language model used by our system. This corpus was recorded by 43 elderly people and 52 non-aged pepole in our laboratory and in a nursing home to study the automatic recognition of speech uttered by aged speakers. This corpus is made of 81 casual sentences, 31 vocal commands for home automation and 58 distress sentences. An excerpt of these sentences in French is given Table 2, the distress sentences identified in the field study reported in section 3.1.1 were included in the corresponding part of AD80. The utterance of some of these distress sentences were integrated into the scenarios with the exception of the two "truefalse" scenarios.

Acquired data: Cirdo-set
In this paper we focus on the detection of the distress calls, therefore we don't consider the audio event detected and analyzed on the fly but only the full records of each scenario. These data sets were transcribed manually using transcriber [26] and the speech segments were then extracted for analysis.
The targeted participants were elderly people that were still able to play the fall scenarios securely. However, the recruitment of such kind of population was very difficult and a part of the participants was composed of people under 60 years old but they were invited to wear a special suit [18] which hampered their mobility and reduced their vision but without any effect on speech production. Overall, 17 participants were recruited (9 men and 8 women). Among them, 13 participants were under 60 and worn the simulator. The aged participants were between 61 and 83 years old.
When they played the scenarios, some participants produced sighs, grunts, coughs, cries, groans, pantings or throat clearings. These sounds were not considered during the annotation process. In the same way, speeches mixed with sound produced by the fall were ignored. At the end, each speaker uttered between 10 and 65 short sentences or interjections ("ah", "oh", "aïe", etc.) as shown Table 1.
Sentences were often close of those identified during the field studies ("je peux pas me relever -I can't get up", "e-lio appelle du secours -e-lio call for help", etc.), some were different ("oh bein on est bien là tiens -oh I am in a sticky situation"). In practice, participants cut some sentences (i.e., inserted a delay between "e-lio" and "appelle ma fille -call my daughter"), uttered some spontaneous sentences, interjections or non-verbal sounds (i.e., groan).

Off line experiments
The methods presented in Section 2 were run on the Cirdo-set corpus presented in Section 3.1.3.
The SGMM model presented in Section 2.2 was used as acoutic model. The generic language model (LM) was estimated from French newswire collected in the Gigaword corpus. It was 1-gram with 13,304 words. Moreover, to reduce the linguistic variability, a 3-gram domain language model, the specialized language model was learnt from the sentences used during the corpus collection described in Section 3.1.1, with 99 1-gram, 225 2-gram and 273 3-gram models. Finally, the lan-
The interest of such combination is to bias the recognition towards the domain LM but when the speaker deviates from the domain, the general LM makes it possible to avoid the recognition of sentences leading to "false-positive" detection. Results on manually annotated data are given Table 3. The most important performance measures are the Word Error Rate (WER) of the overall decoded speech and those of the specific distress calls as well as the Call Error Rate (CER: c.f. equation 2). Considering distress calls only, the average WER is 34.0% whereas it a 39.3% when all interjections and sentences are taken into account.
Unfortunately and as mentionned above, the used corpus doesn't allow the détermine a False Alarm Rate. Previous studies based on the AD80 corpus showed recall, precision and Fmeasure equal to 88.4%, 86.9% and 87.2% [19]. Nevertheless, this corpus was recorded in very different conditions, text reading in a studio, in contrary of those of Cirdo-set.

Discussion
These results are quite different from those obtained with the AD80 corpus (with aged speakers and speaker adaptation): WER was 14.5% [19]. There are important differences between the recording conditions used for AD80 and for the Cirdo-set corpus used in our study that can explain this performance gap: • AD80 is made of readings by speakers sitting in comfortable position in front of a PC and the microphone ; • AD80 was recorded in nearest conditions in comparison with distant setting for Cirdo-set ; • Cirdo-set was recorded by participants who fell on the floor or that are blocked on the sofa. They were encouraged to speak in the same way that they would speak if they would be really put in these situations. Obviously, we obtained expressive speech, but there is no evidence that the pronunciation would be the same as in real conditions of a fall or a blocked hip.
Regarding the CER, its global value 26.8% shows that 74.2% of the calls were correctly recognized ; furthermore, at the exception of one speaker (CER=71.4%), CER is always below 50% consequently more than 50% of the calls were recognized. For 6 speakers, CER was below 20%. This suggests that a distress call could be detected if the speaker is able to repeat his call two or three times. However, if the system did not identify the first distress call because the person's voice is altered by the stress, it is likely that this person will fill more and more stress and as a consequence future calls would be more difficult to identify. In a same way, our corpus was recorded in realistic conditions but not in real conditions and frail elderly people may not be adequately simulated by healthy human adults. A relatively small number of missed distress calls could render the system unacceptable for use amongst the potential user and therefore some efforts in this regard would need to be pursued.

Conclusion and perspectives
This study is focused on the framework of automatic speech recognition applications in smart homes, that is in distant speech conditions and especially in realistic conditions very different from those of corpus recording when the speaker is reading a text.
Indeed in this paper, we presented the Cirdo-set corpus made of distress calls recorded in distant speech conditions and in realistic conditions in case of fall or blocked hip. The WER obtained at the output of the dedicated ASR was 36.3% for the distress calls. Thanks to a filtering of the ASR hypothesis at phonetic level, more than 70% of the calls were detected.
These results obtained in realistic conditions gives a fairly accurate idea of the performances that can be achieved with state of the art ASR systems for end user and specific applications. They were obtained in the particular case of the recognition of distress calls but they can be extended to other applications in which expressive speech may be considered because it is inherently present.
As stated above, obtained results are not sufficient to allow the system use in real conditions and two research ideas can be considered. Firstly, speech recognition performances may be improved thanks to acoustic models adapted to expressive speech. This may be achieved to the record of corpora in real conditions but this is a very difficult task. Secondly, it may be possible to recognize the repetition, at regular intervals, of speech events that are phonetically similar. This last method does not request the good recognition of the speech. Our future studies will address this problem.