Encoder-decoder models for latent phonological representations of words

We use sequence-to-sequence networks trained on sequential phonetic encoding tasks to construct compositional phonological representations of words. We show that the output of an encoder network can predict the phonetic durations of American English words better than a number of alternative forms. We also show that the model’s learned representations map onto existing measures of words’ phonological structure (phonological neighborhood density and phonotactic probability).


Introduction
The representation of linguistic categories is a fundamental problem in (psycho)linguistics and natural language processing. The formation of complex representations from more basic components is relevant at all levels of linguistic representation, semantic, syntactic, and phonological. Finding good representations for words' phonological 1 structure is critical in psycholinguistics, where we wish to understand the phonological structure of the lexicon, which has been shown to be relevant for language comprehension and production.
The distributional hypothesis defines a word by the context in which it occurs (Harris, 1954;Firth, 1957). This approach has been extended more recently to other types of compositional structures, for example in characterizing the meanings and forms of sentences (Cer et al., 2018;Joulin et al., 2017;Conneau et al., 2017;Devlin et al., 2018). In this paper we explore whether distributional approaches can capture important phonological dependencies.
Specifically, we test the extent to which recurrent encoder-decoder models (Cho et al., 2014;Sutskever et al., 2014) can learn representations that characterize the phonological structure of the lexicon while also having linguistic and psychological validity (Sibley et al., 2008). We propose that this approach can be used to learn viable lexical-level phonological representations. The output of the encoder component of our model yields promising results in the prediction of phonetic duration, outperforming a number of alternate phonological representations of words.
2 Quantifying a word's phonology Given a set of discrete phonetic symbols i.e. graphemes with conventionalized pronunciations such as the International Phonetic Alphabet, it is trivial to represent any word's pronunciation as a sequence of such symbols. Conversely, relating sequences of such symbols (viz. words) to each other, as well as to the entire lexicon is less obvious. This challenge has led to a proliferation of measurements that characterize a word's phonetic or phonological relationship with all other words in the lexicon. We summarize some salient examples below, and briefly discuss some of their shortcomings.

Metrics insensitive to serial order
Phonological neighborhood density (PND). This measure is defined as the number of words having a Levenshtein edit distance of one from a given word (in terms of phonetic or phonological symbols) (Luce and Pisoni, 1998;Levenshtein, 1966). Under this definition, a word like "cat" has many neighbors, while a word like "molt" has fewer. This measure is simple to calculate and a wide variety of resources exist for obtaining these measures across many languages (Marian et al., 2012;Baayen et al., 1993;Luce and Pisoni, 1998).
While conceptually simple, PND is insensitive to the position of a segment within a word (e.g. word-initial versus word-final substitutions), and so "sat" and "cab" are treated as equally similar to "cat". Additionally, identifying a word's phonological neighbors using the Levenshtein distance metric requires specifying how many sounds can be added, deleted, or substituted, and potentially the allowable edit distance 2 , increasing the number of choice points in determining what a "neighborhood" is.
Frequency-weighted phonological neighborhood density. An augmented version of PND, which weights phonological neighbors in proportion to their lexical frequencies (standardly estimated from large corpora; Marian et al., 2012). So, a more common word like "hat" would contribute more to the neighborhood density of "cat" than a less common word like "cap", even though they are at equal string edit distance. Whether and to what extent density measures should be frequency-weighted is an empirical question, though these measures seem to better reflect psycholinguistic processes than frequencyinsensitive measures.
Feature-wise similarity. In the phonological literature it is standard to represent segments as collections of articulatory or acoustic features, e.g.
[+voice], [-obstruent] (Chomsky (1968) is the canonical reference). Some linguists (e.g. Frisch (1996), inter alia) have posited that words like "cat" and "cap", which differ only in the place of articulation of their final segments (alveolar versus labial), should be considered more similar than e.g. "cat" and "can", which differ in both voicing and manner of articulation. This measure of similarity is potentially controversial, as there are theoretical and empirical questions as to which features to include, or even whether phonetic features exist at all (Stevens and Blumstein, 1981;Marslen-Wilson and Warren, 1994).

Metrics incorporating serial order
All of the previously described measures effectively characterize words as unordered collections of segments. These characterizations are incomplete because they fail to capture the fact that words unfold over time in usage. Representing the positions of phones within a word is critical for ex-2 See e.g. Suárez et al., 2011 who allow edit distance greater than one, and track the mean distance to a fixed number of neighbors plaining a number of aspects of language processing. For example, the beginnings of words contribute more strongly than their ends to psycholinguistic effects that are attributed to their phonological representations (Levelt et al., 1999;Sevald and Dell, 1994, inter alia), and a word's phonological similarity to the rest of the words in the lexicon has important consequences for speech comprehension (Buz and Jaeger, 2016;Metsala, 1997). Some computational models encode segments as a function of their linear position within a syllable, e.g. in a onset-vowel-coda format (e.g. Dell, 1986;Sevald and Dell, 1994). Other approaches include segment n-grams to encode local aspects of serial order (e.g. Seidenberg and Mc-Clelland, 1989;Davis, 2010) and the oft-lamented Wickelphone (Houghton and Hartley, 1996). Most closely related to the present approach, some work has demonstrated the viability of sequence encoder models for representing sequences of characters or phonetic segments (Sibley et al., 2008).

Incorporating variability into representations
Psycholinguistic measures that quantify words' phonological properties in the lexicon generally ignore their variability in pronunciation. In usage, segmental context, or lexical factors such as word frequency, can significantly influence the phonetic realization of a given phone, ranging from assimilatory processes (Ohala, 1990a) to massive reduction and even complete omission (Pitt et al., 2005;Johnson, 2004, inter alia). For example, there are over 200 distinct transcriptions of the word "and" in the Buckeye corpus (Pitt et al., 2005), and its normative, dictionary pronunciation (i.e. [aend]) only accounts for 3% of its realizations. Measures such as PND rely on single, fixed pronunciations (generally normative/dictionarybased) and corpus-derived lexical frequencies to estimate how many similar-sounding words a given word has, but take no account of variability in realization. As there is evidence that listeners remember and can access/use individual exemplars of perceived speech (Pierrehumbert, 1980;Goldinger, 1998), it seems natural to model distinct realizations within the lexical network. The variability in a word's realizations may especially matter for identifying phonological competitors (Luce and Pisoni, 1998;Marian et al., 2012;Vaden et al., 2009). For example, words like "sand" and "and" may rarely compete during lexical access, given that "and" is rarely pronounced similarly to "sand." By incorporating the variability available in naturalistic speech corpora, we hope to provide a better characterization of a word's phonological properties and its relation to the lexicon.

Latent phonological representations
Representing arbitrary-length sequences of phones with a single distributed representation has a number of potential practical and conceptual advantages. On the practical side, these representations have a fixed dimensionality, so finding meaningful groupings or clusters is computationally more tractable than directly clustering variable-length sequences. Moreover, projecting these sequences into a latent space offers the potential of discovering hidden relationships or variables that affect phonological or lexical structure.
Our aim in this paper is to test whether and to what extent recent approaches to building sentence representations can also be applied to the phonological domain. Both simpler and more complex latent representations can be constructed to characterize the phonological forms of words. We first discuss potential "naïve" means of accomplishing this, and then move into discussion of our proposed model.

Principal components on bag-of-n-phones
A number of document classification schemes and information retrieval tasks have treated documents as a product of the vector representations of words learned by principal components analysis (PCA; Landauer and Dumais, 1997). We apply this to the phonetic domain as well. By analogy to a bag of words, we refer to bag-of-phones (unigram features) and bag-of-n-phones (higher-order segment co-occurrence categories), which can then be fed into a dimensionality reduction algorithm like principal components analysis (PCA) as an approximate composition function to produce latent phonological representations of words.

doc2vec
Another dimensionality reduction method extends the continuous bag-of-words algorithm used to learn word vectors (Mikolov et al., 2013) to the document domain. Specifically, the model learns to compose (predict) a document (i.e. a word) from its phonological contents. doc2vec (Le and Mikolov, 2014) has been used in information retrieval and natural language processing applications (Lau and Baldwin, 2016) and so may be a viable way to obtain lexical phonological representations. As with bag-of-phones, this model is insensitive to serial order.

Sequential representations
Encoder-decoder or sequence-to-sequence (seq2seq henceforth) neural network architectures have shown considerable success in encoding sentences (viz. sequences of words) for tasks such as machine translation (Sutskever et al., 2014;Cho et al., 2014). These methods may be appropriate as a means of composing segmental representations, as they are intrinsically sensitive to ordering, easily take usage frequencies into account (directly from training corpora), and have been shown to be effective learners of sequential distributional properties of their training data.

Seq2seq model
We trained seq2seq models to either reproduce their input, or to recover (predict) normative (dictionary) pronunciations from the phonetic transcriptions of words in the Buckeye corpus (Pitt et al., 2005), a dataset of monologues provided in response to interviewer questions about the talkers' hometown of Columbus, Ohio. The corpus contains approximately 300,000 words.
Data inclusion criteria. There are some transcription errors in the Buckeye corpus, and so we excluded combinations of phones that did not occur at least ten times. This removes many errors, but a few remain. For example, the segment "h" occurs in some transcriptions but is not part of the character set of the transcription dictionary, and is thus likely an error of omission for actual digraphs from the dictionary; "th", "hh", etc. Despite the presence of these remaining errors, we do not correct the transcriptions of any words. In total, 57 phone/segment categories are represented. Full documentation of the coding scheme used in the corpus can be read in Pitt et al. (2005). For bagof-n-phones features, we add the additional characters "w s" and "w e" as word boundary characters, signaling the starts and ends of words, respectively.
There are no standard train/dev/test splits for the Buckeye corpus, and so we restricted ourselves to randomly selected 80/20 train/test split (Pitt et al., 2005) for training all models. Model architecture. Methodologically, we approach the problem with an eye to restricting the computational power of our model, and to restricting the space of hyperparameters to explore. To this end, our models use a basic recurrent encoder-decoder architecture, with an input-side embedding layer, and single-layer, unidirectional 3 LSTMs (Hochreiter and Schmidhuber, 1997) on the encoder and decoder sides. The encoder takes as input a sequence of phone indices (e.g. "cat" 11,1,20]), embeds them, and encodes the sequence in the space defined by the LSTM. The encoder LSTM's final hidden state is provided as input to the decoder, whose task is to "unroll" this latent representation. The outputs of the decoder LSTM are successively fed through a softmax, sequentially outputting class probabilities for each character class in the phone vocabulary, which are then decoded via simple argmax (see Figure 1).
The number of training epochs was empirically determined on the basis of asymptoting training loss, which we determined to be 25 epochs. We used a cross-entropy loss function, using the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001. Other Adam parameters were at default values in the dynet python implementation as of this writing (version 2.0.3; Neubig et al., 2017). All hyperparameters were selected on the basis of asymptoting loss on a small subset of the training set. The embedding layer had 32 dimensions, and the encoder and decoder LSTMs were 64-dimensional.
Tasks. We trained two models to perform slightly different decoding tasks; the Normative Decoder model, and the Observed Decoder model. In both tasks, the inputs are transcriptions of observed realizations of words in the Buckeye corpus, which include e.g. phonetic changes and omissions. The Normative Decoder's task is to output the word's normative pronunciation (e.g. [k, ae, tq] → [k, ae, t]), while the Observed Decoder model is trained as a sequential autoencoder (e.g. Chung et al., 2016); the task is to reproduce the input sequence exactly. Both are potentially viable approaches to the creation of lexical phonological representations and show similar performance in the downstream tasks reported on below, which may be useful for researchers who only have access to normative pronunciations.
We evaluated the performance of the model on the 20% held-out portion of the corpus.

Lexical representations
Once the model is trained, any sequence of phones can be input to the encoder, yielding a latent phonological representation of that sequence. As with character-based NLP models, the comparatively low dimensionality of the input space (57 segments) mitigates sparsity issues, consequently we can obtain latent phonological representations not just of vocabulary words that have been trained but also for rare, out-of-vocabulary (OOV) words and non-words. We plot some aspects of the learned representations in Figures 2 and 3. One pattern that is particularly apparent is that the leftto-right serial nature of the encoder leads to representations that strongly encode the final segment in their representations, for both consonants and vowels.

Evaluation
As a preliminary investigation of the information encoded in the learned lexical representations, we assess their ability to model phonetic duration, which is known to be sensitive to phonotactic probability and phonological overlap (Gahl et al., 2012;Buz and Jaeger, 2016;Yiu and Watson, 2015;Goldrick and Larson, 2008;Vitevitch and Luce, 2005), in addition to other factors like contextual predictability (e.g.  Cohen Priva and Jaeger, 2018;Seyfarth, 2014). We show that the encoder creates sequence representations that are useful for predicting word duration, and compare the success of the encoder to several other models, described below.

Predicting word duration
Ultimately we are interested in whether latent phonological representations have predictive validity for phonetic cues, potentially in conjunction with other phonological and lexical representations. Word duration has been shown to be strongly related to phonological structure (Gahl et al., 2012), because duration may reflect the mechanics of the phonological sequencing process in language production (Yiu and Watson, 2015; Fox et al., 2015) or because speakers lengthen words in dense neighborhoods to promote the listener's understanding (Tily and Kuper-man, 2012).
We built a series of nested statistical models designed to predict whole-word phonetic duration. The durations were obtained by summing up the durations of each of the annotated phonetic segments for an individual word, which are themselves derived from time stamps extracted from the Buckeye metadata. Whole-word durations were log transformed due to their positive skew; failing to account for this can make statistical inference more difficult (Campbell, 1992). All models were constructed using ridge (L1 norm) regression using the scikit-learn package in Python (version 0.2.0; Pedregosa et al., 2011). We report goodness of fit measures in all cases by R 2 values (the coefficient of determination; provided automatically by the score function within the ridge regression model object).
All duration models were trained on the same 80-20% split that was used to train the encoderdecoder. Consequently, there were 282,742 observations (words) during training, and 70,686 words at test. The vocabulary for the bag-of-words representations was estimated from the training data. All models are summarized in Table 1.

Baseline models
Word embeddings. A word's distributional properties, such as its part of speech and meaning; latent part-of-speech; or word-frequency information may reliably predict a word's duration (Seyfarth, 2014;Turnbull et al., 2018;Priva, 2015). Consequently, we incorporate 100-dimensional word embeddings into the regression models. We obtained these word embeddings from gensim's (Řehůřek and Sojka, 2010) skip-gram implementation trained on the Fisher corpus (Cieri et al., 2004), which we selected due to its size, which is critical for generating good word embeddings (Antoniak and Mimno, 2018), and because it belongs to the same domain as the Buckeye corpus (conversational speech).
The skip-gram model used a context window of 5 words and a negative sampling size of 5. We used a zero vector to represent OOV (e.g. Columbus, Ohio-specific place names that would not occur in the Fisher corpus). Word embeddings were, on their own, not a strong predictor of word duration (R 2 = 0.082) on the test set, but nevertheless account for some of the variance in word duration.
Bag-of-phones models. Bag-of-words representations are a useful and informative baseline in other NLP tasks, especially text classification (Wang and Manning, 2012). We obtained bag-ofphone representations by learning a vocabulary on the training data and creating sparse count vectors in which the features represent individual phones. A simple bag-of-uniphones model, which ignores order information, has greater predictive power than word embeddings on the test set (R 2 =0.140). This shows that it is possible to at least partly predict the duration of a given word's realization from relatively unstructured phonological information.
Bag-of-n-phones. Unlike bag-of-words representations, bag-of-ngrams encode localized order information. We constructed n-gram features of phone combinations (bag-of-n-phones) of lengths 2 to 5, using a cutoff frequency of 10 observations. These more complex representations performed similarly to the simpler bag-of-phones model on the test set (R 2 = 0.140).
We also tested whether incorporating word boundary information into these models ("w s" and "w e" phones) would induce boundarysensitive phonotactics, but this also did not provide additional gains over simpler models (R 2 = 0.138 and R 2 = 0.140).
Principal components analysis over bag-ofn-phones. Following from the previous section, we take our bag-of-n-phones representations and feed them into a truncated singular value decomposition model to obtain latent representations of words ("documents"). This representation explained a slightly greater amount of variance in word duration than word embeddings (R 2 = 0.106). However, this method performed far worse than the bag-of-phones and bag-of-n-phones models described in the previous section, indicating that some information is lost in this dimensionality reduction method.
doc2vec. Our doc2vec model vectors were trained to predict a word from a phonological representation. The resulting vectors had the same dimensionality as the PCA vectors and the encoder output of the seq2seq models. Surprisingly, doc2vec performed the worst of models that we considered (R 2 = -0.05).
seq2seq. The outputs of the encoders for the Observed and Normative decoder models were among the best we considered, both on their own and in conjunction with other measures. Interestingly, the Observed Decoder provides a much closer fit to phonetic duration than word embeddings, bag-of-phones, PCA, doc2vec, and the Normative Decoder representations. When combined with bag-of-phones and word embedding information, the Observed Decoder representations explain the greatest amount of variance in word duration (R 2 = 0.181), suggesting that these latent phonological representations encode useful information for characterizing word form.
The disparity between the Observed and Normative decoder models may be a consequence of the Normative model's more difficult learning problem. One potential explanation is that despite training the two models for equal lengths of time (25 epochs), the Normative decoder was not trained to the same criterion as the Observed decoder. Future work should explore whether the worse performance of the Normative decoder model is due to the precision of its representations or due to what is embedded in the representations themselves.

Probing phonological structure
While it is clear that seq2seq representations of the phonological forms of words are partially predictive of a phonetic phenomenon (duration), whether the representations encode anything useful about the lexicon requires further investigation. In this section, we explore whether characterizing the similarity space of these phonological word vectors can approximate standard measures of a word's phonological properties. The results show that the vectors produce coherent clusters of words with different phonological properties. We also show that there are correlations between our measures and phonotactic probability.

Latent phonological neighborhood density
While it is not commonly the case that similarity scores follow a normal distribution, in our case, the similarity scores for words are by visual spot inspection roughly symmetric and normally distributed, so we chose to characterize individual words w i by the mean and standard deviation of their similarity scores to every other word in the lexicon. Although not a priori obvious, one possibility is that these metrics correlate with other lexical metrics, for example, a wide standard deviation could mean that a word has a number of different ways it can be similar to other words, whereas a narrow standard deviation suggests that the word is fairly unique.

The similarity structure of the lexicon
The distributions of similarity scores show some interesting properties. Unlike the measurements of phonological neighborhood density provided in Vaden et al. (2009), which follow a quasi-Zipfian distribution, a histogram of the mean word-lexicon similarities across the whole vocabulary shows a very different pattern. In particular, there appear to be three distinct clusters of similarity scores, as shown in Figure 4. Words in the first cluster, which show negative average similarity scores, were highly frequent words, typically encompassing function words (e.g. but, about, the). The second cluster appeared to include less high-frequency terms (e.g. day, brain, wants). Finally, the rightmost cluster typically had higher similarity scores, representing low frequency and longer words (e.g. devices, widely, element). 4 Going forward, a meta-model will be necessary to determine what factors determine a word's mean lexicon-similarity value.

Correlation with existing phonological properties
Ideally, a new measure of phonological form should relate to measures already known to affect speech production. For example, a significant correlation with a particular word's mean or standard deviation similarity to all the other words in the lexicon would suggest that our measures characterize the lexicon in a similar way to existing measures. Similarly, because our latent representations encode sequences, we expect them to correlate with phonotactic probability (Vitevitch and Luce, 2004). So, as a final set of analyses, we sought to test whether and to what extent the Observed decoder learns representations that can tell us about a word's relationship to the rest of the lexicon.
There are two measures of interest that have received some attention in the speech production literature. For the present analyses, we reference the phonological neighborhood density metrics as well as the phonotactic probability scores for words in Buckeye that are also in the Irvine Phonotactic Online Dictionary (IPhOD; Vaden et al., 2009). We show that our measures (both mean and standard deviation) strongly correlate with phonotactic probability and IPhoD's additional PND measure. This suggests that the vectors' usefulness extends to researchers who wish to explore the phonological similarity structure of the lexicon for psycholinguistic research.
Phonological neighborhood density. Given the importance of phonological neighborhood density (PND) in speech production (Luce and Pisoni, 1998;Vitevitch and Luce, 2005;Metsala, 1997;Mirman, 2011), we correlated the (log) number of phonological neighbors with our latent density scores and phonetic duration. A phonological neighbor is a word that differs by a single sound (either an addition, a substitution, or a deletion; Levenshtein, 1966). PND ((log) # of neighbors, Figure 5) has a strong negative correlation with mean word-lexicon similarity (greater mean similarity translates to fewer neighbors; ρ = -.59) while the standard deviation of word-lexicon similarity shows a non-linear relationship with neighborhood density.
Phonotactic probability. Phonotactic proba-bility is a measure of the phonological typicality of a word, computed from product of uni-phone and bi-phone probabilities of that word pronunciation, in the same fashion that sentence probabilities are computed in a standard bigram language model Luce, 2004, 2005). In our final analysis, we compare the mean and standard deviation of a word's similarity to all other word types, including alternate pronunciations of the same word, to existing measures of phonotactic probability. As with phonological neighborhood density, we see significant positive correlations between our phonological similarity measures (both means and standard deviations; ρ = 0.41 and ρ = 0.13, respectively) between phonotactic probabilities, which we visualize in Figure 5.

Conclusion
The results presented here suggest that encoderdecoder models are a promising framework for composing segment-based representations of words. The models also characterize words' phonological forms relative to the rest of the lexicon. We believe that encoder-decoder models' usefulness extends beyond that of many existing approaches, as they can seamlessly generate gestalt representations for out-of-vocabulary words and even nonce words. Our approach has a number of potential advantages for the cognitive modeling of language processing in both comprehension and production tasks, or indeed in any task that can be modeled with phonological word representations. Importantly, the encoderdecoder modeling framework is flexible, learning both from observed, quasi-phonetic realizations of words as well as from idealized, normative (dictionary-based) pronunciations, and allows for many variations in expressivity and computational power. The reported correlations between phonological neighborhood density, phonotactic probability, latent phonological similarity, and phonetic duration motivate a need to better understand the embedding representations themselves. We have presented considerable evidence that the models capture some non-trivial dependencies between phonetic segments that can characterize word forms. Going forward, we believe that our latent phonological representations may be useful for designing stimuli, or provide an alternative to standard covariates in psycholinguistic experiments such as phonological neighborhood density and phonotactic probability. Finally, our results on the Normative-Decoder suggest that low-resource languages with only a pronunciation dictionary are also a viable means of learning these representations, assuming that there is a corresponding corpus of conversational data. In sum, we have demonstrated that our approach is useful for modeling of phonological structure.