Syntactic parsing of chat language in contact center conversation corpus

Chat language is often referred to as Computer-mediated communication (CMC) . Most of the previous studies on chat language has been dedicated to collecting ” chat room ” data as it is the kind of data which is the most accessible on the WEB. This kind of data falls under the informal register whereas we are interested in this paper in understanding the mechanisms of a more formal kind of CMC: dialog chat in contact centers. The particularities of this type of dialogs and the type of language used by customers and agents is the focus of this paper towards understanding this new kind of CMC data. The challenges for processing chat data comes from the fact that Natural Language Processing tools such as syntactic parsers and part of speech taggers are typically trained on mismatched conditions, we describe in this study the impact of such a mismatch for a syntactic parsing task.


Introduction
Chat language received attention in recent years as part of the general social media galaxy. More precisely it is often referred to as Computer-mediated communication (CMC).
This term refers to any human communication that occurs through the use of two or more electronic devices such as instant messaging, email or chat rooms. According to (Jonsson, 1997), who conducted an early work on data gathered through the Internet Relay Chat protocol and through emails: "eletronic discourse is neither writing nor speech, but rather written speech or spoken writing, or something unique".
Recent projects in Europe, such as the CoM-eRe (Chanier et al., 2014) or the STAC (Asher, 2011) project gathered collections of CMC data in several languages in order to study this new kind of language. Most of the effort has been dedicated to "chat room" data as it is the kind of data which is the most accessible on the WEB. (Achille, 2005) constituted a corpus in French. (Forsyth and Martell, 2007) and (Shaikh et al., 2010) describe similar corpora in English. (Cadilhac et al., 2013) have studied the relational structure of such conversations through a deep discursive analysis of chat sessions in an online video game.
This kind of data falls under the informal register whereas we are interested in this paper in understanding the mechanisms of a more formal kind of CMC: dialog chat in contact centers. This study is realized in the context of the DATCHA project, a collaborative project funded by the French National Research Agency, which aims at performing unsupervised knowledge extraction from very large databases of WEB chat conversations between operators and clients in customer contact centers. As the proportion of online chat interaction is constantly growing in companies' Customer Relationship Management (CRM), it is important to study such data in order to increase the scope of Business Analytics. Furthermore, uch corpora can help us build automatic humanmachine online dialog systems. Among the few works that have been published on contact center chat conversations, (Dickey et al., 2007) propose a study from the perspective of the strategies adopted by agents in favor of mutual comprehension, with a focus on discontinuity phenomena, trying to understand the reasons why miscomprehension can arise. (Wu et al., 2012) propose a typology of communication modes between customers and agents through a study on a conversa-tion interface. In this paper we are interested in evaluating syntactic parsing on such data, with a particular focus on the impact of language deviations.
After a description of the data and the domain in section 2, we introduce the issue of syntactic parsing in this particular context in section 3. Then a detailed analysis of language deviations observed in chat conversations is proposed in section 4. Finaly, experiments of part of speech (pos hereafter) tagging and syntactic parsing are presented in section 5.

Chat language in contact centers
In the book entitled "Digital textuality" (Trimarco, 2014), the author points out that "[. . . ] it would be more accurate to examine Computer Mediated Communication not so much by genre (such as email, discussion forum, etc. . . ) as in terms of communities". The importance of relation between participants is also pointed out in (Kucukyilmaz et al., 2008). The authors insist on the fact that chat messages are targeted for a particular individual and that the writing style of a user not only varies with his personal traits, but also heavily depends on the identity of the receiver (corresponding to the notion of sociolinguistic awareness). Customer-agent chat conversations could be considered as being closer to customer-agent phone conversations than to chat-room informal conversations. However the media induces intrinsic differences between Digital talk and phone conversations. The two main differences described in (Trimarco, 2014) are related to turn taking and synchronicity issues on the one side, and the use of semiotic resources such as punctuation or emoticons on the other.
In the case of assistance contact centers, customers engage a chat conversation in order to solve a technical problem or to ask for information about their contract. The corpus used in this study has been collected from Orange (the main French telecom operator) online assistance for Orange TV customers who contact the assistance for technical problems or information on their offers. In certain cases, the conversation follows a linear progress (as the example given in Figure 1) and in some other cases, the agent can perform some actions (such as line tests) that take some time or the client can be asked to do some operations on his installation which also imply latencies in the conversation flow. In all cases, a chat conversation is logged: the timestamps at the beginning of each line corresponds to the moment when the participant (agent or customer) presses the Enter key, i.e. the moment when the message becomes visible for the other participant.
A conversation is a succession of messages, where several consecutive messages can be posted by the same participant. The temporal information only concerns the moment when the message is sent and there is no clear evidence on when writing starts. There is no editing overlap in the Conversation Interface as the messages appear sequentially but it can happen that participants write simultaneously and that a message is written while the writer is not aware of the preceding message.
As one can see in the example in Figure 1, chat conversations are dissimilar from edited written text in that they contain typos, agrammaticalities and other informal writing phenomena. They are similar to speech in that a dialog with a focused goal is taking place, and participants take turns for solving that goal, using dialogic idiomatic terms which are not found in typical written text. They differ from speech in that there are no disfluencies, and that the text of a single turn can be repaired before being sent. We argue that these differences must be considered as relevant as the two differences pointed out by (Trimarco, 2014).
All these properties along with the particular type of language used by customers and agents is the focus of this paper towards understanding this new kind of CMC data. The challenges for processing chat comes from the fact that analysis tools such as syntactic parsers and pos taggers are typically trained on mismatched conditions, we describe in this study the impact of such a mismatch for these two tasks.

Syntactic parsing of chat language
An accurate analysis of human-human conversation should have access to a representation of the text content that goes beyond surfacic analyses such as keyword search.
In the DATCHA project, we perform syntactic parsing as well as semantic analysis of the textual data in order to produce high-level features that will be used to evaluate human behaviors. Our target is not perfect and complete syntax and semantic analysis of the data, but rather to reach a level allowing to qualify and compare conversations. Yes that's right CUST error code S03 AGENT No problem, I will send you another card to your home address. CUST can I come and get it today AGENT You can't get a card from an Orange store because they can only proceed to exchanges. CUST ok thank you for sending it as soon as possible you have my coordinates AGENT Yes I have them in your record. CUST ok fine within 48h maximum 72h for the card AGENT You will receive it according to delivery time at the address in your record. CUST ok fine thank you AGENT You're welcome AGENT Before you go, do you any other question? CUST no thank you Figure 1: Example of conversation in the TV assistance domain, in its original forme (above) and a translation without errors (below) We believe that the current models used in the fields of syntactic and semantic parsing are mature enough to go beyond normative data that we find in benchmark corpora and process text that comes from CRM chat. The experience we gathered on parsing speech transcriptions in the framework of the DECODA (Bazillon et al., 2012) and OR-FEO (Nasr et al., 2014) projects showed that current parsing techniques can be successfully used to parse disfluent speech transcriptions.
Syntactic parsing of non canonical textual input in the context of human-human conversations has been mainly studied in the context of textual transcription of spontaneous speech. In such data, the variation with respect to canonical written text comes mainly from syntactic structures that are specific to spontaneous speech, as well as disfluencies, such as filled pauses, repetitions and false starts. Our input has some of the specificities of spontaneous speech but adds new ones. More precisely, we find in our data syntactic structures found in speech (such as a loose integration of micro syntactic units into macro structures), and for obvious reasons we do not find other features that are characteristic to speech, such as repetitions and restarts. On the other hand, we find in our data many orthographic errors. The following example, taken in our corpus, illustrates the specific nature of our data: ces deja se que j ai fait les pile je les est mit tou a l heure elle sont neuve All words highlighted can be considered as erroneous either lexically or syntactically. This sentence could be paraphrased by: c'est déjà ce que j'ai fait, les piles je les ai mises toutà l'heure, elles sont neuves Such an utterance features an interesting mixture of oral and written characteristics: the syntax is close to oral, but there are no repetitions nor false starts. Orthographic errors are numerous and some of them are challenging for a syntactic parser.
We present in this paper a detailed analysis of the impact of all these phenomena on syntactic parsing. Other types of social media data have been studied in the literature. In particular tweets have received lately more attention. (Ritter et al., 2011) for example provide a detailed evaluation of a pos tagger on tweets, with the final objec-tive of performing Named Entity detection. They showed that the performances of a classical tagger trained on generic news data drop when applied to tweets and that adaptation with in-domain data helps increasing these performances. More recently (Kong et al., 2014) described a dependency parser for tweets. However, to the best of our knowledge, no such study has been published on social media data from formal on line web conversations.

A study on orthographic errors in agent/customer chat dialogs
Chat conversations are unique from several perspectives. In (Damnati et al., 2016), we conducted a study comparing contact center chat conversations and phone conversations, both in the domain of technical assistance for Orange customers.
The comparative analysis showed significant differences in terms of interaction flow. If chat conversations were on average twice as long in terms of effective duration, phone conversations contain on average four times more turns than chat conversations. This can be explained by several factors: chat is not an exclusive activity and latencies are more easily accepted than in an oral conversation. Chat utterances are formulated in a more direct style. Additionally, the fact that an utterance is visible on the screen and remains visible, reduces misunderstanding and the need for reformulation turns in an interaction. Regarding the language itself, both media induce specific noise that make it difficult for automatic Natural Language Understanding systems to process them. Phone conversations are prone to spontaneous speech effects such as disfluencies, and the need to perform Automatic Speech Recognition generates additional noise. When processing online chat conversations, these issues disappear. However the written utterances themselves can contain errors, be it orthographic and grammatical errors or typographic deviations due to high speed typing, poor orthographic skills and inattention.
In this study we focus on a corpus of 91 chat conversations that have been fully annotated with correct orthographic form, lemma and pos tags. The annotator was advised to correct misspelled words but she/he was not allowed to modify the content of a message (adding a missing word or suppressing an irrelevant word). In order to compare the original chat conversations with Instead of being counted as two errors, agglutinations and splits are counted as one substitution. The evaluation is given in terms of Substitution Error Rate (SER) which is the amount of substitutions related to the total amount of words, and the Message Error Rate (MER) which is the amount of messages which contain at least one Substitution related to the total number of messages. As we are interested in the impact of language deviations on syntactic parsing of the messages, the latter rate should also be looked at carefully.
As can be seen in table 1, the overall proportion of misspelled words is not very high (4.5%). However, 27.2% of the turns contain at least one misspelled word. The number of words written by agents is almost twice as large as the number of words produced by Customers. In fact Agents have access to predefined utterances that they can use in various situations. They are also encouraged to formulate polite sentences that tend to increase the length of their messages, while Customers usually adopt a more direct and concise style. Consequently, Agents account for more in the overall SER and MER evaluation, artificially lowering these rates. In fact, as would be expected, Agents make much less mistakes and the distribution of their errors among conversations is quite balanced with a low standard deviation. The sit-uation is different for Customers where both SER and MER have a high standard deviation (respectively 8.7% and 21.5%). The proportion of misspelled words depends on each Customer's linguistic skills and/or attention when typing.
In order to further study the impact of errors on Syntactic Analysis modules, we propose, as a preliminary study, to evaluate into more details the various types of substitutions encountered in the corpus. We make a distinction between the following types of deviations: • DIACR diacritic errors are common in French as accents can be omitted, added or even substituted (à ->a, très ->trés, energie ->énérgie).
• APOST for missing or misplaced apostrophe.
• AGGLU for agglutinations of two words into one.
• SPLIT for a word split into two words.
• INFL for inflection errors. Morpho-syntactic inflection in French is error prone as it is common that different inflected forms of a same word are homophones (question ->questions). Among these errors, it is very common (Véronis and Guimier de Neef, 2006) to find past participles replaced by infinitives for verbs that end with er (j'ai changé -> j'ai changer).
• SWITCH two letters are switched.
• OTHER for all the other errors.
These types of errors are automatically evaluated in this order and are exclusive (e.g. DEL1C corresponds to words which have one missing character and are not of any preceding type). Table 2 presents the proportion of each type of error observed in the corpus. As can be seen, diacritic deviations are predominant. On the overall, the second source of deviations is the use of erroneous inflection for a same word. It represents a higher proportion for Agents than for Customers. Erroneous use of apostrophes is frequent for Customers but almost never occurs for Agents. Agglutinations are more frequent than splits, and constitue more than 11% of deviations for Agents.   Table 3 presents the repartition of language deviations by pos category. Observing this distribution can give hints on the problems that can be encountered for pos tagging and syntactic parsing. As one can see, function words are generally less error prone than content words. Apart from present participles that are always well written, only proper names and imperative verbs have an SER below the overall SER of 4.5%. But these categories are not highly represented in our data. All other content word categories have an SER above the overall SER. The most error prone category is past participle verbs, which are, as already mentioned, often confused with the infinitive form and which are also prone to inflection errors.

Corpus description
In order to evaluate the impact of errors on pos tagging and parsing, the corpus has been split into two sub-corpora (DEV and TEST]) of similar sizes.
Conversations have been extracted from logs in a chronological way, meaning that they are representative of real conditions, with a variety of call motives and situations. Hence splitting the corpus into two parts by following the chronological order reduces the risk of over-fitting between the DEV corpus and the TEST corpus.  Table 3: Language deviation by pos: proportion of each pos in the corpus and corresponding Substitution Error Rate corrected version. All conversations have been anonymized and personal information has been replaced by a specific label (one label for Customer names, one for Agent names, one for phone numbers and another one for addresses). Hence, the entities concerned by this anonymization step do not account for lexical variety. It is interesting to notice that the number of different words on the Full corpus drops from 2381 when computed on the raw corpus to 2173 (15.3% relative) when computed on the corrected corpus. The proportion of words occurring just once is also reduced when computed over the manually corrected tokens. The statistics of the TEST corpus are comparable. However, the lexical intersection of both corpora is not very high as 10.3% of word occurrences in the TEST corpus are not observed in the DEV corpus (9.1% for Agents and 19.8% for Customers). When computing these rates over the manually corrected tokens, the overall percentage goes down to 9.0% (8.6% for Agents and 17.3% for Customers). These last figures remain high and show that the lexical diversity, if enhanced by scripting errors is already inherent to the data and the domain, with a variety of situations encountered by Customers. Adapting our pos tagger on the DEV corpus is a reasonable experimental approach as the preceding observations exclude the  risk of over-fitting bias at the lexical level.

Tagging
The pos tagger used for our experiments is a standard Conditional Random Fields (CRF) (Lafferty et al., 2001) tagger which obtains state-of-the-art results on traditional benchmarks. We use a coarse tagset made of 18 different parts of speech.
Three different taggers based on the same architecture are evaluated, the first one, T F , is trained on the French Treebank (Abeillé et al., 2003), which is composed of newspaper articles. The second one, T D , is trained on our DEV corpus and the third one, T F D on the union of the French Treebank and our DEV corpus.
Taggers are usually evaluated with an accuracy metric, which is based on the comparison, for every token, of its tag in the output of the tagger (the hypothesis) and its tag in the human annotated corpus (the reference). In our case, the number of tokens in the reference and the hypothesis is not the same, due to agglutinations and splits. In order to account for these phenomena in the evaluation metric, we define conventions that are depicted in Table 5: in case of an agglutination, the tag of the agglutinated token t in the hypothesis is compared to the tag of the first token in the reference (see left part of table 5, where the two tags compared are in bold face). In case of a split, the tag of the first token in the hypothesis is compared to the tag of the token in the reference (see right part of the   The taggers have been evaluated on the TEST corpus. The results are displayed in Table 6 which shows several interesting phenomena.
First, the three taggers obtain significantly different results. T F , which is trained on the French Treebank, obtains the lowest results: 86.59% accuracy on the customer part of the corpus and 90.23% on the agent part. Adding to the French Treebank the DEV corpus has a benefic impact on the results, accuracy reaches respectively 88.83% and 95.51%. The best results are obtained by T D with 90.38% and 96.50% accuracy, despites the small size of the DEV corpus, on which it is trained.
Second, as could be expected, the results are systematically higher on the corrected versions of the corpora. The results are around 4.5 points higher on the customer side and around 1 point higher on the agent side. These figures constitute the upper bound of the tagging accuracy that can be expected if the corpus is automatically corrected prior to tagging.
Third, the results are higher on the agent side, this was also expected from the analysis of the errors in both parts of the corpus (see Table 1).
Tables 7 and 8 give a finer view of the influence of errors on the pos tagging accuracy for tagger T D . Each line of the   error types of Table 2. The second column corresponds to the number of occurrences of tokens that fall under this category. The third column is the number of tokens of this status that were correctly tagged, column four is the accuracy for this status and column five, the contribution to the error rate. Table 7 shows that misspelled tokens are responsible for roughly 40% of the tagging errors. Among errors, the DIACR type has the highest influence on the pos accuracy, it corresponds to 13% of the errors, followed by agglutination. Table 8 shows that erroneous tokens account for 20% of the errors on the agent side. And the first cause of token deviation that provokes tagging errors is DIACR.

Parsing
The parser used in our experiment is a transition based parser (Yamada and Matsumoto, 2003;Nivre, 2003). It is a dependency parser that takes as input tokens with their pos tag and selects for every token a syntactic governor (which is another token of the sentence) and a syntactic label. The prediction is based on several features that combine lexical information and pos tags. Orthographic errors have therefore a double impact on the parsing process: through the errors they provoke on the pos tagging process and the errors they provoke directly on the parsing process. The parser was trained on the French Treebank. Contrary to taggers, a single parser was used for our experiments since we do not have hand corrected syntactic annotation of the DATCHA corpus.
In order to evaluate the parser, we have parsed our DEV corpus with corrected tokens and gold pos tags and considered the syntactic structures produced to be our reference. The results that are given below should therefore be taken with caution. Their absolute value is not reliable (it is probably over estimated) but they can be compared with one another.
The metric used to evaluate the output of the parser is the Labeled Attachement Score (LAS) which is the ratio of tokens for which the correct governor along with the correct syntactic label have been predicted. The conventions of Table 5 defined for the tagger were also used for evaluating the parser.
Three series of parsing experiments were conducted, the first one takes as input the tokens as they appear in the raw corpus and the pos tags predicted with our best tagger (T D ). These experiments correspond to the most realistic situation, with original tokens and predicted pos tags. The second series of experiments takes as input the corrected tokens and the predicted pos tags. Its purpose is to estimate an upper bound of the parsing accuracy when using an orthographic corrector prior to tagging and parsing. The third experiment takes as input raw tokens and gold pos tags. It corresponds to an artificial situation, its purpose is to evaluate the influence of orthographic errors on parsing, independently of tagging errors. Table 9 shows that the influence of orthographic errors on parsing is limited, most parsing errors are due to pos tagging errors.
The table also shows that the difference in parsing accuracy between the customer part of the corpus and the agent part is higher than what it was for tagging. This can be explained by the fact that,  Table 9: LAS of the parser output for three types of input: original tokens (O) and predicted pos tags, corrected tokens (C) and predicted pos tags and original tokens and gold pos tags, computed on the TEST corpus for the customer and the agent parts of the corpus.
from the syntactic point of view, agent utterances are probably closer to the data on which the parser has been trained (journalistic data) than customer utterances.

Conclusion
We study in this paper orthographic mistakes that occur in data collected in contact centers. A typology of mistakes is proposed and their influence on part of speech tagging and syntactic parsing is studied. We also show that taggers and parsers trained on standard journalistic corpora yield poor results on such data and that the addition of a limited amount of annotated data can significantly improve the performances of such tools.