Foreebank: Syntactic Analysis of Customer Support Forums

We present a new treebank of English and French technical forum content which has been annotated for grammatical errors and phrase structure. This double annotation allows us to empirically measure the effect of errors on parsing performance. While it is slightly easier to parse the corrected versions of the forum sentences, the errors are not the main factor in making this kind of text hard to parse.


Introduction
The last five years has seen a considerable amount of research carried out on web and social media text parsing, with new treebanks being created (Foster et al., 2011;Seddah et al., 2012;Mott et al., 2012;Kong et al., 2014), and new parsing systems developed (Petrov and McDonald, 2012;Kong et al., 2014). In this paper we explore a particular source of user-generated text, namely, posts from technical support forums, which are a popular means for customers to resolve their queries about a product. An accurate parser for this kind of text can be used to inform forum-level questionanswering, machine translation and quality estimation of machine translation.
We create a 2000-sentence treebank called Foreebank which contains sentences from the Symantec Norton English and French technical support forums. 1 The phrase structure of the sentences is annotated and any grammatical errors are marked in the trees. Marking the grammatical errors allows us to precisely measure the amount of grammatical noise in this kind of text, and joint error and syntactic annotation allows us to determine its effect on parsing. Foster (2010) explored the effect of spelling errors on parsing performance of conversational forum text. We extend this study to include grammatical errors, focusing on more technical content. Foster et al. (2008) explored the effect of artificially generated grammatical errors on Wall Street Journal parsing. We concentrate on forum text rather than newspaper text, and, crucially, examine the effect of real grammatical errors. We find that the level of grammatical noise is lower than expected, with capitalisation and punctuation errors being the most frequent. While correcting all the errors does result in a performance increase of 1.5% for English and 0.8% for French, the major challenge in parsing these sentences seems not to be "bad language" (Eisenstein, 2013) per se.
The main contribution of the paper is the Foreebank data set itself 2 but we also carry out preliminary parsing experiments evaluating the accuracy of a PCFG-LA parser on Foreebank, examining the effect of grammatical errors on parsing and experimenting with different training sets.

Related Work
Other treebanks of English web text include the English Web Treebank (aka Google Web Treebank) (Mott et al., 2012), the small treebank of tweets and football discussion forum posts described in Foster et al. (2011) and the tweet dependency bank described in Kong et al. (2014). The English Web Treebank is a corpus of over 250K words, selected from blogs, newsgroups, emails, local business reviews and Yahoo! answers. It adapts the Penn Treebank (Marcus et al., 1994) and Switchboard (Taylor, 1996) annotation guidelines to address the phenomena specific to this type of text. The annotation of the 1000sentence treebank described in Foster et al. (2011) is based on the Penn Treebank, whereas the annotation of the treebank described in Kong et al. (2014) is dependency-based. The French Social Media Bank developed by Seddah et al. (2012) is a treebank of 1,700 French sentences from various type of social media including Facebook, Twitter and discussion forums (video game and medical). An extended version of the FTB-UC annotation guidelines (Candito and Crabbé, 2009) is employed during annotation and subcorpora containing particularly noisy utterances are identified.
The main difference between Foreebank and other web/social media treebanks is that grammatical errors in the Foreebank sentences are marked and corrected as part of the annotation process. Error annotation not only provides more insight into this type of text but it also enables us to directly measure the effect of these errors on parsing accuracy and leaves open the possibility of performing joint parsing and error detection by directly learning the error annotation during parser training.
A learner corpus (Granger, 2008) contains utterances produced by language learners and serves as a resource for second language acquisition, computational linguistic and computer-aided language learning research. Examples include the International Corpus of Learner English (Granger, 1993), the Cambridge Learner Corpus (Nicholls, 1999;Yannakoudakis et al., 2011), the NUS Corpus of Learner English (Dahlmeier et al., 2013) and the German Falko corpus (Lüdeling, 2008;Rehbein et al., 2012). We can compare Foreebank to a learner corpus since both contain utterances that are potentially ungrammatical and because in a learner corpus the errors are often annotated, as they are in Foreebank. In the last five years, there have been several shared tasks in grammatical error correction including the Helping Our Own (HOO) shared tasks of 2011 and 2012 (Dale and Kilgariff, 2011;Dale et al., 2012), and the CoNLL 2013 and 2014 shared tasks Ng et al., 2014). With the exception of HOO 2011, all shared tasks involve error-annotated sentences from learner corpora. Annotation schemes vary but most involve marking the span of an error, classifying the error according to some taxonomy designed with L2 utterances in mind, and sometimes providing the correction or "target hypothesis" (Hirschmann et al., 2007).
Regarding syntactic annotation of learner data, Dickinson and Ragheb (2009) propose a dependency annotation scheme based on the CHILDES scheme (Sagae et al., 2007) developed for first language learners. They treat the developing language of learners as an interlanguage, as suggested by Díaz-Negrillo et al. (2010), and annotate it as is. They use two POS tags and two dependency labels for error cases: one for the surface form and one for the intended form. Rosén and De Smedt (2010) criticise the approach of Dickinson and Ragheb (2009) involving "annotating language text as is" arguing that interpretation of the language is required at all annotation levels. They use NorGram, a Norwegian Lexical-Functional Grammar, to annotate a learner corpus with constituency structure, functional structure and semantic structure, in order to provide a means to search for contexts in which learner errors occur. Nagata et al. (2011) describe an English learner corpus which has been manually annotated with shallow syntax, introducing two new POS tags and two new chunk labels for errors.

Building the Foreebank
The Foreebank treebank contains 1000 English sentences and 1000 French sentences. The English sentences come from the Symantec Norton technical support user forum. Half of the French sentences come from the French Norton forum and the other half are human translations of sentences from the English forum. Four annotators were involved in the annotation process. Their main task was to correct automatically parsed phrase structure trees using an annotation tool developed for this project. 3 The English annotators were guided by the Penn Treebank bracketing guidelines and a Foreebank-adapted version of the English Web Treebank bracketing guidelines. The French annotators used the French treebank (FTB) (Abeillé et al., 2003) guidelines, following the SPMRL strategy for multiword expressions (Seddah et al., 2013;Candito and Crabbé, 2009). The two primary annotators, one for French and one for English, annotated all the data for their language. The two secondary annotators annotated a 100-sentence subset. Inter-annotator agreement was calculated by measuring the Parseval Figure 1: The Foreebank annotation of i tried customize too , but i cant find them .. T T corrected as I tried Customize too , but I ca n't find them ... T T

Suffix
Explanation Example Prior to correcting a parse tree produced by the automatic parser, the annotators are asked to correct any errors they find in the sentence. 4 The corrected text is entered in a field of the annotation tool. As part of the syntactic annotation process, errors are marked by appending an error suffix to the preterminals of the affected words in the tree. The error suffixes used in Foreebank are listed in Table 1 and an example tree from Foreebank is shown in Figure 1. There are three kinds of substitution error suffixes: C for marking problems with capitalisation, S for marking spelling errors and W for marking the wrong form of a word which encompasses inflection errors (they instead of them), real-word spelling errors (test instead of text) and lexical choice errors (desk instead of chair). The POS tag of the corrected form is used in the tree instead of the POS tag of the incorrect form. 5 Although this annotation scheme contains fewer error types than the taxonomies used for learner cor-4 Minimal correction is encouraged to prevent annotators from rewriting the sentence in their preferred writing style. Instead they are instructed to just focus on fixing the errors. 5 An alternative would have been to use one POS tag for the erroneous form and one for the corrected form, either combined a la Nagata et al. (2011) or separate a la Dickinson and Ragheb (2009). pora, its granularity increases when the error suffixes are interpreted in the syntactic context in which they occur. For example, we can distinguish a missing determiner (DT D) from a missing preposition (IN D).
The "sentences" that the annotators see are the result of passing the forum text through an automatic sentence splitter (NLTK 6 ) and tokeniser (inhouse). This is another important difference between Foreebank and the English Web Treebank (EWT). In the EWT, sentence boundary detection and tokenisation has been carried out manually before annotation. Both approaches are valid but ours was chosen in order to stay closer to the more realistic scenario of less than perfect automatic preprocessing tools. This means that annotators have a special class of errors that result from noisy sentence splitting and tokenisation that must be marked during annotation.
There are two types of sentence splitting errors: merged sentences such as (1) in which a sentence boundary was not detected before the word When due to the use of a comma instead of a full stop, and split sentences such as (2).
(1) 7. Combofix will start, When it is scanning don't move the mouse cursor inside the box, (2) The questions to <CompanyName>: 6 http://www.nltk.org/ Merged sentences are gathered under one root node with the error suffix M (e.g. S M) , and split sentences are annotated as if they are standalone. Tokenisation problems can also be categorised as merged (3) or split (4 and 5). Merged tokens are treated as a combination of a spelling error (whenI instead of when) and a deleted token (I).
When the split is morphological as in (4), they are tagged with the POS tag of the whole intended token, along with the error suffix B (for "broken"). So in (4), the POS tag of anti would be annotated as NN B and the POS tag of vir as NN B. When there is no such clean morphological break (as in (5)), the first token is treated as a spelling error and the second as an extraneous token.
(3) whenI tried to use ...  Table 2 presents the average and the maximum sentence length in Foreebank, and, for comparison, WSJ and FTB. It also gives the out-ofvocabulary (OOV) rate of these data sets with respect to the WSJ and FTB. The Foreebank sentences are shorter on average than the WSJ and FTB sentences. The table also shows that the OOV rate of Foreebank with respect to WSJ/FTB is high: 33.3% for English and 39.1% for French. These numbers can be compared to the OOV rate of the WSJ test set with respect to its training set which is 13.2% and the FTB which is 20.6%. The higher OOV rate for the French Foreebank compared to the English is most likely due to the larger size of the WSJ compared to the FTB. The OOV rate of the English Foreebank is more than 2.5 times as large as that of the WSJ test set, while the OOV rate of the French Foreebank is less than 2 times as large as that of the FTB test set. This suggests that a bigger performance drop due to unknown words should be expected in parsing the English Foreebank sentences than the French. The last four columns in Table 1 display the absolute and relative frequency of each error suffix. In sum, it seems that capitalisation is the major error type in Foreebank especially in the French data. Deleted tokens are also a major source of problem on the English side. Most of the capitalisation errors involve proper nouns (e.g. product names) and most of the deleted tokens are cases   of missing punctuation. Overall, the errors occur on only a small fraction of the tokens in both data sets. We also calculate the edit distance between each Foreebank sentence and its correction by summing the number of error suffixes and dividing by the maximum of the original and corrected sentence lengths. The average edit distance for the English section of Foreebank is 0.04 and for the French section is 0.03. Despite the existence of some near-to-incomprehensible sentences, the overall error level is very low.

Parsing the Foreebank
We first evaluate newswire-trained parsers on Foreebank, using our in-house PCFG-LA parser with the max-rule parsing algorithm (Petrov and Klein, 2007) and 6 split-merge cycles. The English model is trained on the entire WSJ and the French model on the entire FTB. For comparison, we parse the WSJ/FTB and so we additionally use models trained only on the training sections. We remove the error suffixes and any Dsuffixed nodes (representing deleted words) from the gold Foreebank trees before evaluation. The results are shown in Table 3. As expected, we see a significant drop for both languages when we move from in-domain data to Foreebank. Compared to parsing the English side of Foreebank, the performance drop for French is relatively smaller: the former drops 14.2 points from 89.6 F 1 points to 75.4 and the latter 5.3 points from 81.3 to 76. This suggests that, either the French parsing model is better generalisable to the forum text, or alternatively, that the FTB test set is more distant from its training set than the WSJ one. The second hypoth-esis is more likely because 1) it is on par with the OOV rate observed in Section 4, and 2) the performance of the English and French parsers are close on Foreebank but further apart on the newswire test sets. The effect of using the entire WSJ and FTB instead of only their training sections is also worth noting. While adding the WSJ development and test sets (about 5,500 sentences, a 14% increase) improves the F 1 of English parsing by 1.6 points, the 2,500 FTB development and test sentences (a 25% increase) have little effect on the French parsing, suggesting that either these new sentences are still not enough or do not bring additional information to the parsing model.
Since the annotators correct the errors made by the forum users, we are able to parse the corrected versions of the Foreebank sentences and examine how accurately they are parsed compared to the original sentences. We use the WSJ all and FTB all parsing models described above. Correcting the user errors before parsing leads to an improved parsing F 1 of 78.6 for the English sentences, an increase of 1.6 points (2%). A smaller impact is observed on the French sentences where the edited sentences receive an F 1 of 77.1 (an increase of 0.8 points). Referring to the distribution of error suffixes in Table 1, this suggests that the inserted and deleted tokens may have a larger effect on parser error than the substituted tokens, as their number is higher for English. Many substitution errors are capitalisation errors, typically involving a confusion between proper and common nouns, which tends not to affect the surrounding tree.
The simplest method to improve the accuracy of parsing Foreebank is to use it as supplementary training data. We do this using a 5-fold cross validation, in which Foreebank is randomly split into five parts, with each part used for the evaluation of the parsers trained on WSJ/FTB plus the other four parts. The results are shown in Table 4. Combining the larger treebank and Foreebank improves the F 1 by 2.6 points for English and 3.2 for French. Considering that Foreebank is orders of magnitude smaller than the WSJ/FTB, these gains are encouraging. We try to overcome the small size of Foreebank by 1) using the EWT as training data, and 2) increasing the weight of Foreebank by training on multiple copies of it. The EWT is not a substitute for the WSJ but it does provide a modest improvement when used in conjunction with Foreebank and WSJ. The replication of Foreebank In all experiments up to now, we have excluded the error suffixes from the Foreebank trees (during training and testing). We next try to directly learn trees containing the error suffixes (except for deleted tokens). That is, we use the original Foreebank trees containing the error suffixes for training and evaluate against Foreebank trees containing the error suffixes. The second last row of Table 4 shows the 5-fold CV results when the version of Foreebank without the error suffixes is used for training and the last row the results when the error suffixes are included. Including the suffixes decreases the accuracy, most likely due to the increased data sparsity caused by the suffixed tags.