Enriching ASR Lattices with POS Tags for Dependency Parsing

Parsing speech requires a richer representation than 1-best or n-best hypotheses, e.g. lattices. Moreover, previous work shows that part-of-speech (POS) tags are a valuable resource for parsing. In this paper, we therefore explore a joint modeling approach of automatic speech recognition (ASR) and POS tagging to enrich ASR word lattices. To that end, we manipulate the ASR process from the pronouncing dictionary onward to use word-POS pairs instead of words. We evaluate ASR, POS tagging and dependency parsing (DP) performance demonstrating a successful lattice-based integration of ASR and POS tagging.


Introduction
Parsing speech is an essential part (Chow and Roukos, 1989;Moore et al., 1989;Su et al., 1992;Chappelier et al., 1999;Collins et al., 2004) of spoken language understanding (SLU) and difficult because spontaneous speech and syntax clash (Ehrlich and Hanrieder, 1996;Charniak and Johnson, 2001;Béchet et al., 2014). Pipeline approaches concatenating a speech recognizer, a POS tagger and a parser often rely on n-best hypotheses decoded from lattices. While n-best hypotheses cover more of the hypothesis space than the 1-best hypothesis, they are redundant and incomplete. Lattices on the other hand are efficiently representing all hypotheses under consideration and therefore allow recovery from more ASR errors. Recent work on recurrent neural network architectures with lattices as input (Ladhak et al., 2016;Su et al., 2017) promises the use of enriched lattices in SLU.
The main contribution of this work is establishing a joint ASR and POS tagging approach using the Kaldi (Povey et al., 2011) toolkit. To that end, we enrich the ASR word lattices with POS labels for all possible hypotheses on the word level. This enables subsequent natural language processing (NLP) machinery to use these syntactically richer lattices. We present our proposed method in detail including Kaldi specifics and address problems that occur when data that requires both speech and text information is used. Our results show a slight but consistent improvement of the joint model throughout the evaluations in ASR, POS tagging and DP performance.

Resources
We need a data resource with rich annotations for training our integrated model. Since the training process requires audio transcriptions, POS labels and gold-standard syntax annotations, all of these need to be available. Considering the general premise in data-driven methods that more data is better data, we choose the Switchboard-1 Release 2 1 (Godfrey et al., 1992) corpus with about 2400 dialogs. The Switchboard (SWBD) corpus has more recently been furnished with the NXT Switchboard annotations 2 (Calhoun et al., 2010). NXT provides a plethora of annotations and most importantly for our work, an alignment of Treebank-3 3 (Marcus et al., 1999) text and SWBD transcriptions 4 . While the Treebank-3 corpus pro-vides syntax and POS tags, the transcriptions are timestamped. The alignment of these two resources offered by the NXT corpus contains all necessary annotations.

Audio
Kaldi's SWBD s5c recipe subsets the SWBD (LDC97S62) corpus into various training and development sets for acoustic model (AM) and language model (LM) training. For ASR evaluation, the s5c recipe uses a separate evaluation corpus LDC2002S09 5 of previously unreleased SWBD conversations (Linguistic Data Consortium, 2002), which was not available to us. Likewise unavailable were the Fisher corpora LDC2004T19 6 (Cieri et al., 2004) and LDC2005T19 7 (Cieri et al., 2005), which contain transcripts of conversational telephone speech for language modeling. We utilize the available SWBD data (the training set in the s5c recipe) and split it into training, development and evaluation set. Our results are therefore not directly comparable to other results generated from the Kaldi s5c recipe. We instead split our sets after the Treebank-3 splits as proposed by Charniak and Johnson (2001). This leads to less training data compared to the standard s5c recipe, but also yields splits common in parsing. A data summary of our SWBD splits is given in Table 1. The lmdev section of the SWBD corpus serves as the LM's development set and was "reserved for future use" (Charniak and Johnson, 2001, p. 121  NXT pointer to link equivalent words in the two versions. Section 5.1 describes the method used to create the alignment between the two transcriptions. We refer to the words from the Treebank3 transcript as words and the words from the MS-State transcript as phonwords, since the MS-State transcript words have start and end times in the audio file and hence are slightly more phonetically grounded. The double inclusion does result in redundancy, but has the advantage of retaining the internal consistency of prior annotations. For the most part, the MS-State transcription is more accurate than the Treebank3, so the other option would have been to attach all of the annotations that were derived from the Treebank transcription to the MS-State transcription and discard the original Treebank transcription. However, attaching the Treebank annotations exactly as they are would have made the resource difficult for the end-user to interpret. For instance, where the MS-State transcription adds words to the original, the syntactic annotation would appear inconsistent. On the other hand, creating new annotations to cover the changed portions of the transcription would have been time-consuming for little gain and would have greatly complicated the relationship between the NXT-format data and the original. Figure 1 shows our solution diagrammatically. As can be seen, where there are differences in the representation of a word in the two transcripts (e.g. in the treatment of contractions like doesn't), one Treebank3 'word' may link to more than one MS-State 'phonword', or vice versa.
An extract of the XML representation of 'words' and 'phonwords' is given below (doesn't from Fig. 1 Fig. 1 Representation of the MS-State and Treebank3 Switchboard transcripts in NXT. Words in the Treebank3 transcript are represented by 'word' elements in one NXT layer, while those in the MS-State transcript are represented by 'phonword' elements in an independent layer. Representations of the same word in the two transcripts are linked by an NXT pointer labeled 'phon'. In some cases, such as contractions, words are tokenized differently in the two transcripts, so there may be multiple 'words' pointing at a 'phonword' or vice versa. Note that the star (*) shows that this structure is the expansion of the abbreviated word/phonword structure shown in Fig pect of choosing the Treebank-3 over the MS-State transcription, is the incongruity of utterances (cf. Calhoun et al., 2010, ch. 3.3, p. 393ff). Training and evaluation become easier if the utterances are congruent in the transcription and the Treebank-3 data with the syntactical parses. We decided to directly base the transcriptions on these annotations.

Syntax annotation
The linguistic structure annotated in the SWBD Treebank-3 section is available through the NXT Switchboard annotations and is based on the Treebank-3 text. Choosing the Treebank-3 transcription as the gold standard for the ASR system directly yields Treebank-style tokens in the recognized speech. The POS tagset (Calhoun et al., 2010, p. 394) consists of the 35 POS tags 8 in the Treebank-3 tagset. Disfluencies in the SWBD corpus are annotated following Shriberg (1994) and they are present in the Treebank-3 annotations.

Proposed method
First, we describe the ASR component based on the default Kaldi s5c recipe that generates POSenriched word lattices in detail. Second, we introduce the POS taggers considered for the pipeline system. Third, we briefly characterize the dependency parser in our experiments.

ASR with POS tagging
Starting from the s5c recipe, all but the acoustic modeling part underwent significant changes. The pronouncing dictionary (or lexicon), LM and resulting decoding graph now all contain word-POS pairs rather than words. We are going to outline this process step by step.
Corpus setup: Our model does not access resources other than the Switchboard-1 Release 2 (LDC97S62, with updates and corrected speaker information) data, the MS-State transcription and the Switchboard NXT corpus as described in Section 2. All transcription-based resources are being lowercased as they are in the s5c recipe scripts.
Transcription generation: To get a Treebankstyle transcription, we query the NXT annotation corpus for pointers from MS-State tokens to Treebank-3 tokens. With this mapping, we pick the POS tags for the Treebank-3 orthography and the timestamps for the MS-State words. An example for the POS-tagged gold standard transcription is: "are|VBP you|PRP ready|JJ now|RB".
POS-enriched lexicon: We first append the lexicon with some handcrafted lexical additions for contractions of auxiliaries and adjust for tokenization differences between the source MS-State format and the target Treebank-3 format. The pronunciation of the resulting partial words is taken from the respective full entries in the dictionary supplied with the MS-State transcriptions. The lexical unit "won't", for example, is mapped to the pronunciation "w ow n t" in the MS-State version, but is not readily merged from the existing partial words ("wo" and "n't") in the MS-State lexicon and therefore is a lexical addition. Other auxiliaries, like "can't" that needs to be split as "ca n't" to conform with the Treebank-3 tokenization, and partial words in general, are added in the lexicon conversion via automated handling where all partials exist.
For all gold standard occurrences of word-POS combinations, we copy the words' pronunciations for all of the POS tags they occur with. Partial words starting with a hyphen are automatically added to the lexicon without the hyphen to account for tokenization differences. Duplicate word-POS pairs are excluded. Figure 2 shows part of the resulting POS-enriched lexicon, where "read" occurs with four different POS tags and two distinct pronunciations. We use "<unk>|XX" for unknown tokens. Note that our scheme can overgenerate word-POS combinations, as it does not check whether the pronunciation variation occurs with all POS tags of a word (compare left and right parts of Figure 2). read|VB r eh d read|VB r iy d read|VBD r eh d read|VBD r iy d read|VBN r eh d read|VBN r iy d read|VBP r eh d read|VBP r iy d read r eh d read r iy d Language modeling: LM training is performed on the train set with the lmdev set as heldout data. We train the LM on the POS-enriched transcription directly. See Figure 3 for example trigrams.  Different from the s5c recipe, we compute trigram and bigram LMs with SRILM 9 (Stolcke, 2002) and "<unk>|XX" as unknown token. As discussed in Section 2, we did not use SWBDexternal resources for mixing and interpolating our LMs. We use SRILM with modified Kneser-Ney smoothing (Chen and Goodman, 1999) with interpolated estimates, and use only words occurring in the specified vocabulary and not in the count files. We report LM perplexity (PPL) on the lmdev held-out data in Table 2. Note that the joint model LM in Table 2 encounters 150 OOV tokens (e.g. hyphenated numerals like "thirty-seven"). The PPLs increase slightly for the joint model because the vocabulary has n entries for each word, where n is the number of POS tags the word occurs with.
Acoustic modeling: We use the original s5c recipe and only adjust the training, development and evaluation splits after Charniak and Johnson (2001)   (with one context phone to the left and right) model which was trained with speaker-adaptive training (SAT, Anastasakos et al., 1996;Povey et al., 2008) technique using feature-space maximum likelihood linear regression (fMLLR, Gales, 1998). We train this tri4 AM on the training split in Table 1 with duplicate utterances removed.

Baseline POS tagging
We perform POS tagging with three out-of-thebox taggers, two of them with pretrained models, and choose the best one for our baseline pipeline model. NLTK's (Bird et al., 2009) former default maximum entropy-based (ME) POS tagger with the pretrained model trained on WSJ data from the PTB (for an overview, see Taylor et al., 2003) is the first tagger and we term it ME.pre. We also train a ME POS tagger 10 that is implemented after Ratnaparkhi (1996) on the first 70,000 sentences 11 of our SWBD training split, described in Section 2, and denote our self-trained model by ME.70k. We configure the ME classifier to use the optimized version of MEGAM (Daumé III, 2004) for speed.
The second tagger is NLTK's current default tagger, based on a greedy averaged perceptron (AP) tagger developed by Matthew Honnibal 12 . We name the AP tagger with the pretrained NLTK model AP.pre, and the same tagger trained on the full training split AP.
To have an NLTK-external industry-standard POS tagger in our comparison, we also run spaCy's POS tagger (see https://spacy.io/, we used spaCy in version 1.0.3) with its pretrained English model (also trained with AP learning). part-of-speech-pos-tagger-in-python

Dependency parsing
In this work, we compare dependency parsing results of (a) the 1-best hypothesis of the baseline tri4 ASR system with the self-trained AP POS tagger and (b) the 1-best hypotheses of our joint model. We use a greedy neural-based dependency parser reimplemented after the greedy baseline in Weiss et al. (2015).
The parser's training set is the gold standard data of the training split and identical for the tri4 and the Joint-POS model with 62728 trainable sentences out of 63304 (= 99.09%). In this evaluation, we tune the parser based on development data and use word-and POS-based features. The parser implementation uses averaged stochastic gradient descent proposed independently by Ruppert (1988) and Polyak and Juditsky (1992) with momentum (Rumelhart et al., 1986). We do not embed any external information.

Results
Our evaluation includes intermediate ASR and POS tagging results and a DP-based evaluation. We evaluate partially correct ASR hypotheses with a simplistic scoring method that allows imprecise scoring when the recognized sequence of tokens does not match the gold standard.

ASR
We test our joint ASR and POS model against the default tri4 model in a ASR-only evaluation of the 1-best hypotheses. As we generate the word-POS pairs jointly and they are part of the ASR hypotheses, we strip the POS tags for the word-only evaluation in Table 3. We evaluate the ASR step based on word error rate (WER) and sentence error rate (SER).

Set
Default tri4   Recall that these results are not directly comparable to other ASR results on the SWBD corpus, because of our data splits with less training data and use of the Treebank-3 transcription. In the unaltered (apart from the splits, see Section 2.1), original s5c recipe, the WER on the eval set with the original MS-State transcriptions (48926 tokens, 4331 utterances) is 26.51% with a SER of 67.91%. Compared to the baseline, the results of our Joint-POS model are slightly better for the dev set and eval set in SER, and for the eval set also in WER.

POS tagging
We present an evaluation of our joint model's performance up to the baseline model's POS tagging step. We compare against the POS tagger performance on the 1-best ASR hypotheses in the pipeline approach. As the 1-best hypotheses of joint and pipeline model can differ, we evaluate the POS tagging step on ASR output against the word-POS pair Treebank-3 gold standard by means of WER.   Table 4 shows that the Joint-POS model consistently outperforms the baseline POS taggers on both sets. The pretrained models clearly have not been trained on speech data and unsurprisingly perform poorly. Our self-trained ME and AP models improve at least 6% in WER and 15% in SER over the pretrained models. The margin by which our joint model surpasses the self-trained AP tagger is small with an improvement of 0.25% WER on the dev and 0.58% WER on the eval set. The self-trained AP tagger performed best of the baseline taggers and we therefore use it in for the DPbased evaluation in the next section.

DP
We evaluate our joint ASR-POS model on the target task by running a dependency parser on POS-tagged 1-best hypotheses. In the competing pipeline model, we score the output of the default tri4 ASR 1-best hypotheses tagged by the AP tagger we trained ourselves. All results in Table 5 and Table 6 show that our joint model does profit from the joint ASR and POS modeling in our approach.   Table 5 features evaluations of six different development and evaluation sets. The sets named dev and eval are the common subsets of tokenlevel correct hypotheses that the pipeline and joint model share and therefore can be directly compared on. The sets indexed with a P or J are the token-level correct hypotheses for the pipeline and joint model respectively. As the models are not identical with respect to their 1-best hypotheses that match the Treebank-3 data, we also present the results using all available correctly tokenized ASR hypotheses. Our Joint-POS model consistently outperforms the pipeline tri4 approach between 1.11% (dev, UAS) and 0.24% (eval, UAS) on the common subsets. The results are similar for the non-matching subsets. Note, that the results in Table 5 are for the small subset of utterances with a correct token sequence, i.e. where the (converted and filtered) Treebank-3 sentence tokens match the ASR hypothesis words exactly. This restriction allows an evaluation with LAS and UAS because the tokenization is identical and we have gold data for this correct token sequence. To (a) have a more extensive evaluation on all the utterances we have hypotheses for 13 and (b) be able to compare the pipeline and joint approach on the hypotheses coverage and close misses of the correct tokenization, too, we present Table 6.
We cannot use the standard parsing evaluation measures that depend on a correct word sequence to get scores on imperfectly recognized utterances.
We address this problem with a simple but imprecise solution: (1.) Parse the development and evaluation set using the parser models previously trained and tuned on the common sets (see Table 5); (2.) Evaluate the parser predictions on the ASR hypotheses against the gold Treebank-3 data with a imprecise scoring method that allows for a mismatch of the gold and predicted token sequence. We introduce two simple scores, unlabeled score (US) and labeled score (LS), with their names derived from UAS and LAS respectively (see Table 6). Recall that UAS requires a relation's head and dependent to match including their position and LAS requires a matching label (or dependency type) on that relation in addition.
The imprecision in the US and LS scoring stems from ignoring the positions of head and dependent in the utterance completely. We iterate over the utterances and for every token (or dependent) look up its head (word) and count this relation as a US match if the lookup is successful. When there is a US match, we also check for a matching label and count that as an LS match. The US and LS counts are normalized by the number of tokens in the Treebank-3 reference. The improvement our Joint-POS model shows over the pipeline tri4 model is small for all scores, but consistent.

Model
Set  Table 6: Parsing results on full dev and eval sets. LAS, UAS, LS and US are given as percentages. The dev set has 3994 utterances with 44760 tokens and the eval set has 3912 utterances with 43277 tokens. Best scores per set in boldface.

DP-based analysis
We tentatively analyze in which cases the joint model does better than the pipeline approach. We first give absolute counts for how often this is the case in Table 7. While the Joint-POS model receives higher counts for all scores, there are also a considerable number of cases where the pipeline model makes fewer mistakes. We pick all examples randomly from the instances counted in the All column of Table 7 and focus on short sentences for presentability.  In the following examples, we highlight the important differences in boldface. In Figure 4, we see a fully correct Joint-POS model. While the pipeline approach does also recognize the correct word sequence, a POS tagging error causes the parsing to be erroneous on two arcs. This error affects all four scores (UAS, LAS, US and LS), as the parsing model not only misclassifies the label, but also attaches the head of "there" incorrectly. We visualize the error's effect in a correct vs incorrect tree comparison. We observe a recognition error in the pipeline tri4 model that causes a different reading and syntactical structure in Figure 5. While it is acceptable spontaneous speech (e.g. "I like rock.. and like some country music."), "and" would not be the subject of the sentence. The third graph visualization in Figure 6 illustrates an ASR deletion error on the first word. The pipeline tri4 model handles the error gracefully, but receives lower US and LS scores because of the token mismatch nonetheless. If we had not allowed the imprecise evaluation, we would not have observed this kind of error.
The example in Figure 7 also has an ASR error in the pipeline approach at its core. In this case,  The example utterance in Table 8 contains ASR errors in the both models' hypotheses with subsequent errors in POS tagging and parsing. We can glean that discourse interjections like "uh.. uh.." can be misrecognized as regular words, an error characteristic of spontaneous speech. Note, that the joint model gets the word "families" right, but as an object instead the subject. The pipeline model produces four word errors in sequence and "families" does not appear in its hypothesis.

Related work
Spoken language poses a variety of problems for NLP. The recognition of spoken language can suffer from poor recording equipment, noisy environments, unclear speech or speech pathologies. It also exhibits spontaneity, ungrammaticality and disfluencies, e.g. repairs and restarts (cf. Shriberg (1994)). Hence, in addition to ASR errors, downstream tasks such as parsing have to deal with these difficulties of conversational speech, whether the ASR output is in the form of n-best sequences or lattices. Jørgensen (2007) remove disfluencies prior to parsing and find their removal improves the performance of both a dependency and a head-driven lexicalized statistical parser on SWBD. In a more general joint approach of disfluency detection and DP, Honnibal and Johnson (2014) in contrast to Jørgensen (2007) make use of the disfluency annotations and report strong results for both, disfluency annota-tion and DP. Rasooli and Tetreault (2013) extend the arc-eager transition system (Nivre, 2008) with actions that handle reparanda, discourse markers and interjections, thereby also explicitly using marked disfluencies on SWBD for joint DP and disfluency detection. Where Rasooli and Tetreault (2013) and Honnibal and Johnson (2014) work with SWBD text data, Yoshikawa et al. (2016) are close to our setting and assume ASR output text as parser input. Yoshikawa et al. (2016) create an alignment that enables the transfer of gold treebank data to ASR output texts and add three actions to manage disfluencies and ASR errors to the arc-eager shift-reduce transition system of Zhang and Nivre (2011). While they do not parse lattices or confusion networks (lattices can be converted to confusion networks, see Mangu et al. (2000)) directly, Yoshikawa et al. (2016) use information from word confusion networks to discover erroneous regions in the ASR output. Charniak and Johnson (2001) parse SWBD after removing edited speech that they identify with a linear classifier. Additionally, Charniak and Johnson (2001) introduce a relaxed edited parsing metric that considers a simplified gold standard constituent parse (removed edited words are added back into the constituent parse for evaluation). Johnson and Charniak (2004) model speech repairs in a noisy channel model utilizing tree adjoining grammars (TAGs). Source sentence probabilities in the noisy channel are computed with a bigram LM and rescored with a syntactic parser for a more global view on the source sentence. The noisy channel is then formalized as TAG that maps source sentences to target sentences, where repairs are treated as the cleaned target side of the reparanda on the source side. Besides the words themselves, Johnson and Charniak (2004) use POS tags for the alignment of reparandum and repair, which indicates their usefulness in detecting disfluencies. Approaching spontaneous speech issues from another angle, Béchet et al. (2014) adapt a parser trained on written text by means of an interactive web interface (Bazillon et al., 2012) in which users can modify POS and dependency tags writing regular expressions.
Natural speech poses specific problems, but also comes with acoustic information that can improve parsing speech through its incorporation (Tran et al., 2017) or reranking (Kahn et al., 2005). Handling disfluencies following Charniak and Johnson Treebank-3 (2001), Kahn et al. (2005) rerank the n-best parses using a set of prosodic features in the reranking framework of Collins (2000). Kahn et al. (2005) find that combining prosodic features with non-local syntactic features increase F -scores in the relaxed edited metric of Charniak and Johnson (2001). Kahn and Ostendorf (2012) present an approach that automatically recognizes speech, segments a stream of words (e.g. a conversation side/speaker turn) into sentences and parses these. A reranker that can take into account ASR posteriors for n-best ASR hypotheses as well as parse-specific features for m-best parses can then jointly optimize towards WER (n hypotheses) or SParseval (Roark et al., 2006) (n × m hypotheses) metrics (Kahn and Ostendorf, 2012). Ehrlich and Hanrieder (1996) describe an agenda-driven chart parser that considers an acoustic word-level score from a word lattice and can combine a sentencespanning analysis from partial hypotheses if a full parse is unobtainable. Tran et al. (2017) use speech and text domain cues for constituent parsing in an attention-based encoder-decoder approach based on Vinyals et al. (2015). They show that word-level acoustic-prosodic features learned with convolutional neural networks improve performance.

Discussion
Replacing words with word-POS pairs throughout the ASR process, as described in Section 3.1, increases the search space considerably. We focus on establishing the feasibility of this approach here and do not detail techniques to address this complexity issue. Including prior distributions of word-POS pair occurrences could help disambiguation early on in lattice creation. The LM in the joint model relies on word-POS pairs as well, and a smoothing approach that backs off to ngrams of words instead of n-grams of word-POS pairs would counter the increased sparsity due to the combination of words and their POS tags in the LM part. We only explore instances of errors the joint and pipeline models make in our analysis. A systematic error analysis identifying advantages and disadvantages of the joint model would be interesting, especially with the errors involving contractions and disfluencies. As a negative example for our joint model, we observed the separation of "didn't" as "did" plus "n't" as an ASR error for "did it". A qualitative analysis of error types could indicate whether this a random or systematic error, and the same is true of the positive examples in Section 5.

Conclusion
We have demonstrated a method to jointly perform POS tagging and ASR on speech. The tagging and parsing evaluations of the pipeline model vs our joint model confirm the successful integration of POS tags into speech lattices. While the improvements over the pipeline approach are small, we enrich lattices with POS tags that allow for latticedbased NLP in future work.