The role of context in neural pitch accent detection in English

Prosody is a rich information source in natural language, serving as a marker for phenomena such as contrast. In order to make this information available to downstream tasks, we need a way to detect prosodic events in speech. We propose a new model for pitch accent detection, inspired by the work of Stehwien et al. (2018), who presented a CNN-based model for this task. Our model makes greater use of context by using full utterances as input and adding an LSTM layer. We find that these innovations lead to an improvement from 87.5% to 88.7% accuracy on pitch accent detection on American English speech in the Boston University Radio News Corpus, a state-of-the-art result. We also find that a simple baseline that just predicts a pitch accent on every content word yields 82.2% accuracy, and we suggest that this is the appropriate baseline for this task. Finally, we conduct ablation tests that show pitch is the most important acoustic feature for this task and this corpus.


Introduction
Prosody is a rich information source with the potential to improve performance in many spoken NLP tasks (Roesiger et al., 2017;Niemann et al., 1998). In order to make prosodic information available to downstream tasks, many models have been proposed to predict which words in an utterance carry pitch accents-word-level prosodic prominences signaled by a deviation from the speaker's usual pitch, duration, intensity, or some combination of these three features. Identifying pitch accents is helpful since they are often used to signal important or unexpected information. For example, pitch accents in English typically fall on content words, which are generally more informative. When a pitch accent falls on a function word, it indicates that it is unusually informative, as in the sentence, They ran out of toilet paper even before the quarantine, where before is more informative because it contrasts with what might be a default assumption (e.g., during).
Previous pitch accent prediction models include rule-based models (Brenier et al., 2005), traditional machine learning models (Wightman and Ostendorf, 1994;Levow, 2005;Gregory and Altun, 2004), and neural models Stehwien et al., 2018).  and Stehwien et al. (2018) (henceforth, SVS18) showed that neural methods can perform comparably to traditional methods using a relatively small amount of speech context-just a single word on either side of the target word. However, since pitch accents are deviations from a speaker's average pitch, intensity, and duration, we hypothesize that, as in some non-neural models (e.g. Levow (2005)), a wider input context will allow the model to better determine the speaker's baseline for these features and therefore improve its ability to detect deviations. In addition, we hypothesize that a recurrent model (rather than the CNN used by SVS18) will also improve performance, since it is better adapted to processing long-distance dependencies.
In this paper, we test these hypotheses by building a new neural pitch accent prediction model that takes in prosodic speech features, text features, or both. Our main contribution is showing that these context-enhancing innovations in the speechonly model improve performance on a corpus of American English speech, yielding higher accuracy than SVS18 and all previous models on this dataset. We also find that a baseline of simply labeling all content words with pitch accents is very robust, matching the performance of the text-only model. We argue that this more robust contentword baseline is the correct baseline for this task. We find that our speech-only model is able to outperform this baseline by detecting some of the cases where a speaker deviates from the predictions of the content-word baseline, and we provide an analysis of which acoustic features yield the

Models
We build models to predict which words carry a pitch accent, given input of either prosodic speech features, text features, or both. The variants are shown in Figure 1 and described below.
We also describe the ways in which we varied the amount of context available to the speech-only model in particular.
Speech-only model. Like SVS18's model, our speech encoder begins with several CNN layers that take a series of frames f 1 , f 2 , ..., f n as input, where each frame f i is a vector of 6 acoustic-prosodic features (see §3). These frames are encoded by the CNN, which reduces the overall number of frames by passing a kernel over the input with a stride of size 2, resulting in frames f 1 , f 2 , ..., f k . However, rather than predicting the label for a single token at a time, as SVS18 do, our model labels the whole sequence at once. In order to divide the output of the CNN into word tokens, we use the token timestamps provided in the corpus to divide the frames at places corresponding to word boundaries in the input, similar to the approach taken in Tran et al. (2018). Each resulting subdivision of the frames [f i , f i+1 , ..., f j ] contains different numbers of frames, since tokens are of various lengths. To obtain token representations of identical size, we sum across all frames for a given token: .., f k ). Each token embedding t 1 , ..., t m is then passed into a bidirectional LSTM, and finally a feed forward layer that outputs a prediction for each token. The model's hyperparameters are described in detail in Appendix A.1.
Our full model takes an entire utterance as input and predicts all labels at once, but we also experiment with using only three or one token(s) as input.
In these cases, the model only predicts the label for the central input token. The three-token scenario is designed to be most similar to SVS18's model. Text-only model. The text-only model is a simple bidirectional LSTM. An embedding for each token is passed to the BiLSTM and a prediction is made at each timestep. We followed SVS18 in using pretrained 300d GloVe word embeddings (Pennington et al., 2014), although using pre-trained embeddings did not improve performance much over randomly initialized embeddings.
Speech+text model. The speech-only and textonly models both include a bidirectional LSTM, so for the combined model, we just concatenate the embedding for each token generated by the CNN encoder with the pretrained text embedding for that token before passing them to the LSTM.
Baselines. In addition to a majority class baseline, we also report results on a content-word baseline, where all content words (non-stopwords as identified by NLTK) are labelled as carrying a pitch accent. We also report a duration-only baseline, where the input features to the speech-only model are all replaced with the value 1-so the model can only tell how many frames each token contains.

Data and experimental setup
We train and test all models using data from the Boston University Radio News Corpus (hereafter BURNC), a speech corpus of General American English that is partially annotated with prosodic information. The annotated subsection of the corpus that we use includes five speakers, three female, and two male, all of them trained radio journalists reading pre-written news segments. The data we use amount to approximately 2.75 hours of speech. Though this is a limited amount of data, this corpus is one of very few corpora with available prosodic annotations and enables us to compare with previous studies that use this resource, including SVS18.
For the speech-only model, we follow SVS18 in using the OpenSMILE toolkit  to extract six features, which fall into three broad categories: pitch features (smoothed F0), intensity features (RMS energy, loudness), and voicing features (zero-crossing rate, voicing probability, and harmonics-to-noise ratio). These features are extracted from frames of varying sizes (following ), and frames are offset by  10ms. The speech-only model has no access to phone-level or spectral information that might allow it to make predictions based on word identity. The transcription of the speech in this corpus includes marked breaths, which we use to segment the corpus into utterances. Note that there are no explicit correlates of duration in this feature set, though the model has access to the absolute duration of each token via the number of input frames per token. In future, we could follow Tran et al. (2018) by giving an explicit feature for the duration of a given token normalized by the average duration of that token in the corpus.
For the text-only model, we follow SVS18 in removing contractions (e.g. we'll −→ we), though we diverge in leaving hyphenated tokens in place (e.g. eighty-eight remains eighty-eight).
We perform tenfold cross-validation of all experiments and report the average of these performances. For results on the test set, we repeat this tenfold cross-validation five times with different random seeds. We report accuracy as our primary metric since this task is a balanced binary classification task.

Results and discussion
Development set results from the speech-only model using different input contexts and architectures are shown in Table 1. The results confirm our hypotheses that it should help to include more input context (full utterances rather than only three tokens as in SVS18) and to use an LSTM to permit better use of that context. Note that our full utterance CNN-only model actually has more parameters than the CNN+LSTM model (∼14m, vs. ∼12m), so the improvements of the latter are not just due to model size.
In contrast, development set experiments with the text-only model found little effect of context or architecture (see Appendix A.2), and indeed even our best text-only model is not much better than the content-word baseline, which in turn outperforms SVS18's text-only model (as shown in Table 2 for the test set and Appendix A.2 for the dev set). This suggests that although text-only context might help identify pitch accented words in principle, even powerful neural models are not well able to exploit the right information (or perhaps require discourse level context, which we did not provide). This conclusion is further supported by an additional analysis where we progressively reduced the vocabulary size of the text-only model from 3000 down to 5. As shown in Figure 2, we found that performance was steady until vocabulary dropped below 100 words (with the rest labelled as 'UNK'). This strongly suggests that either word frequency or the strongly correlated content/function word distinction are the main source of information for the text-only model. Of course, absolute word duration is also strongly correlated with frequency and content/function, and we note that the durationonly speech model also achieves a similar accuracy to the content-word baseline (Table 2).
Overall, our best speech-only model outperforms  Table 2), and combining speech plus text yields a small additional improvement. Our analysis shows in particular that the speech-only model outperforms the text-only model in places where the speaker's realization deviates from the content-word baseline: the speech-only model can correctly detect some pitch accents that fall on function words (as in (1a) that; pitch accents are labeled as 1) or unaccented content words (as in (1b) Mary).
(1) a. Input: Speech: Text: If we only consider these tokens where the speaker's production deviates from the contentword baseline, the speech-only model achieves 66.7% accuracy, vs. only 38.2% for the text-only model. tokensCNN

Speech feature ablation tests
The duration-only baseline shown in Table 2 shows that the speech model is able to perform quite well given only information about token length, without access to prosodic features, but that these prosodic features are still used in achieving the speech-only model's performance.
In order to determine the relative importance of various prosodic features, we group the prosodic features into those related to pitch (smoothed F0), intensity (RMS energy, loudness), and voicing (harmonics-to-noise ratio, zero-crossing rate, voicing probability), and ablate one or two sets of features at a time. We test these models by training them with full utterance context and with more a limited three-token context, as well as with the full CNN+LSTM architecture and the more limited CNN-only architecture. The results of these experiments on the development set can be seen in Figure 3.
Pitch seems to play the biggest role of these features, with its ablation leading to the lowest performance in all cases. Voicing appears to be the weakest feature set, actually harming model performance in one case: intensity and voicing features combined underperform intensity features alone.
All three groups of prosodic features seem equally dependent on the inclusion of context, with the removal of the LSTM and restriction to a threetoken context leading to proportionally similar drops in performance. This supports our hypothesis that acoustic correlates of prosody cannot be evaluated in isolation: a high pitch or intensity is only meaningfully high compared to some lower pitch or intensity.

Conclusions
This work demonstrates some important principles for predicting pitch accent from text and speech. First, we show that a speech-only model benefits from having utterance-level context. Second, we show that both the text and the speech-only model derive at least some of their performance from being able to distinguish function words from content words. In fact, our BiLSTM-based text model can hardly outperform a content-word baseline. Finally, we show that a speech-only model can successfully predict pitch accent in cases where a text-only model cannot, and that combining text and speech provides only a tiny benefit. These results indicate that the speech-only model uses information available in the prosodic features to surpass the contentword baseline, and that knowing the actual words doesn't provide much further useful information. Figure 4: The performance of the speech-only model given different CNN hyperparameters, tested on a development set using tenfold cross-validation. When varying CNN filter width, the CNN layers were kept invariant at 3; when varying the number of CNN layers, the filter width was kept invariant at 11 frames.
Many of our hyperparameter experiments focused on changes to the CNN that should allow it to process a wider swath of the input at once: adjusting filter width, and adjusting the number of CNN layers. Neither change showed significant positive effect, and both were harmful when taken to the extreme. As can be seen in Figure 4, given a constant depth of 3 CNN layers, the very narrowest kernels underperformed, but widening the kernel did not consistently produce better performance, and eventually degraded performance. Likewise, adding CNN layers-which increases the number of frames of the input data being viewed by the final CNN layer-was actively harmful to performance beyond depths of 3 layers.

A.2 Development set results
Speech Text Sp+text Our model 89.1 84.5 89.8