Acquiring language from speech by learning to remember and predict

Classical accounts of child language learning invoke memory limits as a pressure to discover sparse, language-like representations of speech, while more recent proposals stress the importance of prediction for language learning. In this study, we propose a broad-coverage unsupervised neural network model to test memory and prediction as sources of signal by which children might acquire language directly from the perceptual stream. Our model embodies several likely properties of real-time human cognition: it is strictly incremental, it encodes speech into hierarchically organized labeled segments, it allows interactive top-down and bottom-up information flow, it attempts to model its own sequence of latent representations, and its objective function only recruits local signals that are plausibly supported by human working memory capacity. We show that much phonemic structure is learnable from unlabeled speech on the basis of these local signals. We further show that remembering the past and predicting the future both contribute to the linguistic content of acquired representations, and that these contributions are at least partially complementary.


Introduction
How children acquire language from the environment is one of the fundamental mysteries of cognitive science. Much theoretical, experimental, and computational research into this question has focused on acquiring abstractions over lowerorder symbols, such acquiring morphemes from phoneme sequences or syntactic structures from word sequences (Chomsky, 1965;Gold, 1967;Elman, 1991;Saffran et al., 1996;Albright, 2002;Klein and Manning, 2004;Goldwater et al., 2009;Christodoulopoulos et al., 2012, inter alia). Children, however, do not get symbolic input; symbolic representations at any level of granularity constitute abstractions inferred from highly variable, noisy, and information-rich perceptual signals like audition and vision. This work joins a growing computational literature exploring the kinds of architectures and learning objectives that best support acquisition of linguistic representations directly from the speech signal without supervision (Versteegh et al., 2015;Dunbar et al., 2017). Such models can be used to test questions about language acquisition under more realistic assumptions about the input signal, especially to the extent that they reflect known constraints on human cognition (Shain and Elsner, 2019;Beguš, 2020).
This study uses computational modeling to examine two influential and possibly complementary ideas about how people learn abstract representations, including language, from data: learning to remember, and learning to predict. Both hypotheses have been advocated by prior work in language acquisition, cognitive neuroscience, and computational modeling, yet their relative contributions to language learning are not yet clear. Our model permits precise manipulation of memory and prediction pressures during acquisition, allowing direct comparison of these hypotheses.
In so doing, we implement several constraints on real-time language processing that have not been simultaneously present in prior modeling of this domain: (1) we jointly segment and label the speech signal without supervision; (2) the learning objective is applied incrementally during real-time processing using only locally available feedback; (3) the encoded signal is segmental, sparse, and hierarchically organized; (4) segments are represented featurally as patterns of activation, rather than discrete category symbols; and (5) the system is optimized by modeling its own state at multiple timescales, rather than by modeling the data alone.
Results show a systematic improvement along multiple measures of phoneme induction quality from both learning to remember and learning to predict, suggesting that these two kinds of signals may play complementary roles during child language acquisition. The contributions of this work are as follows: • We propose a novel deep neural encoderdecoder for unsupervised speech processing that is incremental, segmental, and useful for testing hypothesized cognitive constraints.
• We show empirically that memory-based and prediction-based signals contribute separately to the acquisition of linguistic regularities, simultaneously supporting two existing classes of theories about the learning pressures that underlie human language acquisition.

Memory, Prediction, and Learning
Many proposals from the language acquisition literature appeal to memory pressures as a learning signal (Newport, 1990;Pinker, 1991;Carstairs-McCarthy, 1994;Rissanen and Ristad, 1994;Baddeley et al., 1998;Goldsmith, 2003;Yang, 2005, inter alia). For example, Baddeley et al. (1998) invoke constraints on working memory, arguing that because the speech signal is too rich to support full retention during real-time language processing (Baddeley and Hitch, 1974), infants are guided toward phonemic representations, which constitute an efficient encoding of that signal. Meanwhile, classical theories of language acquisition such as Newport (1990) and Pinker (1991) invoke constraints on long-term memory, arguing that linguistic regularities constitute compressed descriptions of the learner's input and that their discovery reduces the amount of information that must by idiosyncratically stored. Artificial language learning patterns in humans (Kersten and Earles, 2001) and recent computational modeling of the speech domain (e.g. Lee and Glass, 2012;Lee et al., 2015;Kamper et al., 2015;Elsner and Shain, 2017;Kamper et al., 2017a;Shain and Elsner, 2019) have supported a contribution from memory constraints to language learning. This position also aligns with an extensive computational neuroscience literature on sparse coding, which holds that biological neurons are tuned for memory-efficient representations of recent stimuli (Attneave, 1954;Olshausen andField, 1996, 2004;Sheridan et al., 2017).
Nonetheless, debate exists about the role of memory in language learning. For example, Rohde and Plaut (1999) fail to replicate findings from Elman (1993) in favor of Newport (1990). In addition, Perfors (2012) fails to find evidence that memory bottlenecks encourage discovery of underlying linguistic regularities in adults and argues that such limitations only support language learning in concert with strong inductive priors. Furthermore, evidence suggests that mental representations during language processing preserve acoustic details over and above symbolic codes (Andruski et al., 1994;McMurray et al., 2002). Related work has called into question both the memory efficiency of human mental representations and the severity of long-term memory limits. For example, experimental evidence indicates that human mental representations contain redundant information, both of language (Baayen et al., 1997) and of other constructs such as logical relations (Piantadosi et al., 2016). In addition, recent estimates of mental storage requirements indicate that lexical information, especially semantics, already requires vastly more storage than e.g. phonemes and syntax, suggesting little added memory benefit from optimizing the efficiency with which regularities are stored (Mollica and Piantadosi, 2019). Finally, recent computational evidence linking memory bottlenecks to success in unsupervised speech processing has relied on storage of arbitrarily long acoustic sequences in their full detail in order to compute reconstruction losses (Kamper et al., 2015;Elsner and Shain, 2017). This design is inconsistent with known constraints on the storage duration (< 1s) of unanalyzed acoustic traces in human working memory (Baddeley and Hitch, 1974;Cowan, 1984). It is thus not yet clear (1) how strongly memory pressures constrain mental representations of speech or (2) how much they encourage language learning. Memory efficiency is not the only objective that can be constructed to learn abstractions over data without supervision. It has also been proposed that language learning may be driven by optimizing prediction of future input (Rohde and Plaut, 1999;Johnson et al., 2013;Phillips and Ehrenhofer, 2015;Apfelbaum and McMurray, 2017). This proposal aligns with an extensive neuroscience literature arguing that predictive coding for future inputs is a "canonical computation" of the human brain (Keller and Mrsic-Flogel, 2018) and may better characterize the tuning of biological neurons than sparse coding (Singer et al., 2018), possibly because prediction affords advantages in critical tasks (Nijhawan, 1994) and may help organisms filter noise from the perceptual signal by focusing attention on features relevant to prediction (Bialek et al., 2001). Additional support for a role of prediction in language learning comes from the success of incremental language models in natural language processing, which optimize prediction of future words (Ney et al., 1994;Heafield et al., 2013;Jozefowicz et al., 2016;Radford et al., 2019). Language models support dramatic performance improvements in language processing tasks (Radford et al., 2019) and have been shown to both (1) acquire linguistic abstractions without direct supervision (Linzen et al., 2016) and (2) covary with human language comprehension measures (Frank and Bod, 2011;Goodkind and Bicknell, 2018;van Schijndel and Linzen, 2018). Finally, experimental evidence indicates that infants chunk the speech stream at points of low transition probability, suggesting that predictive signals are exploited to learn word-like units (Saffran et al., 1996).
We address these questions computationally by manipulating the presence or absence of memory and prediction pressures in the joint objective of an unsupervised incremental speech processing model, allowing us to quantify the contributions of these two hypothesized learning signals under realistic constraints on real-time processing.

Recurrent, Hierarchical, and Segmental Speech Processing in Humans
Artificial recurrent neural networks such as those employed here were initially proposed as algorithmic-level (Marr, 1982) models of activity in biological neural networks (Little, 1974;Hopfield, 1982), and subsequent studies support ubiquitous recurrence in the cortex (Harris and Mrsic -Flogel, 2013). In addition, influential theories of biological neural information processing argue that biological neural circuits integrate information at multiple hierarchically-organized timescales (Kiebel et al., 2008;Hasson et al., 2015;Norman-Haignere et al., 2020). Further neuroscientific evidence indicates that segmentation of the time dimension plays a critical role in human cognition, both in domain-general event processing (Zacks et al., 2001;Jensen, 2006, inter alia) and in speech processing specifically (Sanders and Neville, 2003;Cunillera et al., 2006Cunillera et al., , 2009Kooijman et al., 2013;Lee and Cho, 2016, inter alia). Segmentation or "chunking" also plays a central role in several theories of language comprehension (Sanford and Sturt, 2002;Hale, 2006;Frank and Christiansen, 2018) and learning (Monaghan and Christiansen, 2010;McCauley and Christiansen, 2019). Our model incorporates these notions architecturally, with segment boundaries implemented by "detector neurons" that govern information flow between neural populations at larger and smaller timescales (Masquelier, 2018).

Modeling the Mental State
Many theories of linguistic structure posit multiple, hierarchically organized levels of representation (Chomsky, 1957;Goldsmith, 1976). Such theories predict the existence of abstractions over abstractions, latent structures that describe the distribution of other latent structures. This idea accords with recent theories of generalized Bayesian learning in biological agents, in which neural populations are thought to model the activity of other neural populations within their Markov blanket (Friston, 2010). The notion of learning through modeling other elements of the agent's own mental state has been exploited in symbolic computational models of language acquisition (Lee and Glass, 2012;Lee et al., 2015), but not in the context of artificial neural zero-resource speech models, which have so far derived their objective exclusively from the data (Kamper et al., 2017a;Elsner and Shain, 2017). Our approach incorporates this idea by optimizing higher layers to predict the sequence of activations at lower layers.

Related Computational Approaches
This work is part of a growing interest in unsupervised representation learning from raw speech, A symbolic Bayesian framework for joint unsupervised phoneme segmentation and clustering is proposed by Lee and Glass (2012) and extended by Lee et al. (2015). Their system infers a Dirichlet process hidden Markov model to learn a symbolic sequential encoding of the speech stream. A disadvantage of this approach for the present research question is that the categorically distributed phone labels lack any notion of featural relatedness, contrary to widely held assumptions about natural language phonology (Clements, 1985). In addition, the learning signal derives from a next-frame prediction objective, making it difficult to use the model to factorially manipulate memory and prediction pressures. Another recent framework for unsupervised phone segmentation identifies boundaries at points of high surprisal in a frame-level language model (Michel et al., 2017). This approach does not generate segment encodings and cannot straightforwardly be used to test claims about the role of memory in language learning.

Model
Like many prior ANN zero-resource speech processing models (e.g. Kamper et al., 2015Kamper et al., , 2017aElsner and Shain, 2017;Shain and Elsner, 2019), we employ an encoder-decoder framework. However, unlike previous approaches, our model decodes incrementally and hierarchically, with each layer decoding its inputs at their own timescale over a short window backward into the past and/or forward into the future. The model is thus required not only to describe the input signal (speech), but also its own sequence of latent representations (e.g. phones, words, etc.), much as people are implicitly thought to do in prior symbolic work on unsupervised language learning (Goldwater et al., 2009;Lee et al., 2015). Our encoder model closely follows Chung et al. (2017), and thus the primary technical contribution of this work lies in the cascaded incremental decoder and the layerwise incremental objective described below, both of which are designed to encourage repurposable segment representations based on locally available information. Although encodings are ultimately the quantity of interest in unsupervised encoder-decoder models, prior work has shown that decoder design can be a major determinant of acquired representations (McCoy et al., 2018(McCoy et al., , 2020. The overall design is schematized in Figure 1. Code is available at https://github.com/coryshain/dnnseg.

Encoder
Our encoder closely follows a hierarchical multiscale extension (HM-LSTM, Chung et al., 2017) of long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997). The encoder consists of multiple LSTM layers linked by discrete boundary neurons that govern memory retention and information flow between layers. When a boundary neuron fires in layer l, it terminates a segment. Layer l then ejects its hidden state representation to layer l + 1, receives top-down input from the hidden state of layer l + 1, and resets its cell state (incremental memory) in order to process the next segment. The hidden state at a boundary thus constitutes a label for the segment terminating at that boundary, which is used to summarize the content of the segment when communicating with other layers. When the boundary neuron at layer l does not fire, layer l + 1 is inert and simply copies its representation forward. As a result, higher layers track information at longer timescales than lower layers, and the segmentation behavior at l determines the input timescale at l + 1. Each layer proceeds by segmenting and labeling its input signal at a timescale learned from data, resulting in a hierarchical sequence of labeled segments. As argued in Chung et al. (2017), this design enforces a trade-off between recurrent information (which is erased by segmentation) and top-down information (which is made available by segmentation). 1 Although the linguistic quality of discovered HM-LSTM segments is not systematically examined in the original proposal (Chung et al., 2017) and recent analysis has called it into question (Kádár et al., 2018), our results indicate that HM-LSTMs can discover segmental structure from speech, at least at the phonemic level.

Decoder
The decoder consists of two multi-layer attentional sequence-to-sequence (seq2seq) LSTMs with L layers each, one backward-directional (memory) and one forward directional (prediction). The LSTMs respectively decode the B previous input segment labels and the F following input segment labels given an encoder representation at layer l and input L1 L2 L3 time Figure 1: Incremental layerwise encoder-decoder framework. Shown here with 3 layers and a forward/backward window size of 3. Segment boundaries are shown in cyan. Gray arrows indicate information flow through the encoder, as governed by the boundary decisions. Colored arrows indicate information flow from encodings to decoder targets in the backward (orange) and forward (green) directions, starting from the encoded timestep at the center of the figure.
time t. In addition, the predicted sequence from the layer above serves as attention values to inform decoding at the current layer, at a timescale determined by the segmentation patterns of the current layer and the layer below. The internal behavior of the decoder is thus tightly coupled with the segmental behavior of the encoder, providing direct feedback into the encoder decisions. In addition, the label sequence of the decoder at all layers must support both (1) decoding of the perceptual signal (the data), since top-down connections allow higherlevel representations to inform lower-level ones, and (2) decoding of lower-level state sequences. 2

Objective
We employ an incremental layerwise objective that both reconstructs backward and predicts forward from time t at layer l over the segment labels from layer l − 1 at a timescale defined by the segmentation behavior of layer l−1. Thus only the first layer decodes at the timescale of the data; higher layers l decode at the timescale of l−1, and representations associated with non-boundaries in l − 1 are ignored by the objective. The objective scans incrementally over the time dimension and imposes a forward and backward cost at every segment boundary identified by layer l − 1. As a result, the first layer ("phonemes") is responsible for incrementally decoding the local past and future realization of the acoustic stream, the second layer is responsible for decoding the local past and future realization of the 2 See Appendix C for definition of the decoder. "phoneme" sequence, etc. 3 Although it is possible to backpropagate into the decoding targets (i.e. encoder representations) at higher layers, thereby encouraging the encoder to discover more predictable segment sequences, we found in practice that doing so resulted in a form of mode collapse where labels became insensitive to the data and converged to a single value for all timesteps. For this reason, we stop the gradients into decoding targets and backpropagate only into the decoder predictions. Thus, the objective encourages encodings at higher layers to change to better predict structures at lower layers-but does not alter the representations at those lower layers to make them more uniform and therefore easier to predict.

Experimental Design
We assess the contribution of memory and prediction pressures to phoneme acquisition by (1) manipulating these pressures on models exposed to speech data from two unrelated languages (Xitsonga and English) and (2) evaluating the effect of these manipulations on multiple measures of phoneme induction quality.

Data
We use the Zerospeech 2015 (Versteegh et al., 2015) challenge data in English and Xitsonga, a Bantu language spoken in South Africa. The Xitsonga data come from the NCHLT corpus (De Vries et al., 2014) and contain 2h29m07s of read speech from 24 speakers. The English data come from the Buckeye Corpus (Pitt et al., 2005) and contain 4h59m05s of spontaneous speech from 12 speakers. For English, we additionally include the official development set in training, which contains 1h39m45s of spontaneous speech from 2 speakers, also from the Buckeye Corpus. English development set performance was used for model development and tuning, but the development set is not included in the evaluations presented here. Xitsonga lacks a development set, so designs selected on the English development set are applied directly to Xitsonga for evaluation. Before fitting, we convert the source audio files into a cochleagram-based spectral representation that approximates the signal generated by the human auditory system (McDermott and Simoncelli, 2011). 4 200

Experimental Manipulations
We seek to assess the contribution of both memory and prediction pressures to the content of model representations. To this end, our principal manipulations are the backward (B ∈ {0, 5, 25, 50}) and forward (F ∈ {0, 1, 5, 10}) window lengths used by the decoder, which respectively impose a pressure to efficiently remember and accurately predict. Note that the condition B = 0, F = 0 (no reconstruction or prediction) is not well defined because the objective is 0 at any parameterization and thus has no gradient, and we therefore exclude it from consideration in these results. In addition, we manipulate the number of encoder layers (L ∈ {2, 3, 4}). This is because it is unclear a priori which layers of the encoder are expected to correspond to quantities of interest like phonemes or words, since the representations are unsupervised and the model could additionally or instead discover e.g. subphonemic, morphemic, phrasal, intonational, and other kinds of structures. Although detection of these and other levels of linguistic representation is of interest and is the target of future work, the annotations provided by the Zerospeech 2015 data support phoneme-level and word-level analyses only, and we concentrate our evaluation there. Varying the number of layers allows us to investigate which layers emergently discover more phoneme-like units, and under what conditions.

Evaluation
Because our model generates a segmental encoding of the speech signal, we apply two classes of evaluation in this study: phoneme segmentation and phoneme-level probing classification. The segmentation evaluation measures the degree of correspondence between the model-generated segment boundaries and expert-annotated phoneme boundaries, using a boundary F-measure which assigns a true positive for up to one predicted boundary that falls within some tolerance of each gold boundary, false positives for all other predicted boundaries, and false negatives for all gold boundaries that lack a predicted boundary within the tolerance. Following Lee and Glass (2012), we use a tolerance of 20ms. The classification evaluation measures the amount of signal in model-generated encodings as to (1) the true identity of the phoneme being encoded and (2) the cluster of phonological features associated with that phoneme (Hayes, 2011). Following e.g. Shain and Elsner (2019) and Chrupała et al. (2020), we do so using probing classifiers. In particular, for each layer of each model's encoder, we fit linear classifiers to (1) the phoneme labels and (2) the phonological feature labels associated with the gold phoneme segment corresponding to each phone boundary. We extract the gold and predicted phoneme encodings at the human-annotated phoneme boundaries, regardless of whether the model segmented at that location. This supports direct comparison of metrics across models, since the set of evaluated segments is held constant. Phonological features are extracted at the same timepoints, following the procedure described in Shain and Elsner (2019). 5 Although our model is designed to support joint discovery of multiple layers of representation, we find empirically that no model appreciably improves at any layer in word boundary F-score over a baseline that segments only at the ends of voice activity regions, and qualitative inspection does not indicate systematic correspondence to an unannotated level of representation such as syllables, morphemes, or intonational units. Despite differences in segmentation rate, and thereby in word boundary precision-recall trade-off, models generally converge on similar (low) word boundary F-scores, and thus our manipulations are not informative about word learning. Probe-based classification metrics are not well suited to word-level evaluation due to the size of the vocabulary. Though human speech processing involves units between the phoneme and word level, detailed analysis of such units is difficult due to the lack of annotation in the corpus. We believe poor word discovery at higher layers may be due in part to the fact that noninitial layers have both a non-stationary objective (the evolving representations of the layer below) and slower learning dynamics, perhaps making it difficult for these layers to "catch up" with moving targets (Ioffe and Szegedy, 2015). We leave exploration of possible remedies to future research and focus here only on the phoneme level.
While it is a priori unclear which layer of the encoder is expected to encode phonemes (for example, the initial layers may encode sub-phonemic units), we find systematically better phoneme segmentation and classification performance in the first layer of the network. For simplicity, we therefore only present metrics from this layer.
we report performance improvements from each model relative to (1) baseline U (untrained), an architecturally matched model left at random initialization (Chrupała et al., 2020), and (2) baseline X (cross-language), the architecturally matched model trained on the opposite language. 6 These two baselines quantify different contributions of the acquisition process. Baseline U quantifies architectural inductive bias: how well does the architecture alone guide linguistic representations, without learning? Baseline X quantifies modality inductive bias: how well does general knowledge of human speech guide linguistic representations, without exposure to the target language? Improvement against either of these baselines supports language learning from experience, over and above any prior knowledge that might more efficiently be innately encoded. 7

Results and Discussion
Boundary and macro-averaged phoneme and feature classification F-measures from the bestperforming configuration on the English development set (B = 25, F = 1, L = 3) are given in Table 1. English boundary performance (F = 65.3) approaches previously reported unsupervised phoneme segmentation scores on different and therefore not directly comparable datasets (Lee and Glass, 2012;Michel et al., 2017, both around 6 For Xitsonga, baseline X is the architecturally matched English-trained model. For English, baseline X is the architecturally matched Xitsonga-trained model. 7 We do not evaluate directly against a previous state of the art because no state of the art exists for unsupervised phoneme segmentation and classification in the Zerospeech 2015 data. A previous model that performed the same task (Lee and Glass, 2012) achieved an average boundary F-score of 76.1 on a different dataset that used a different boundary annotation standard (automatic forced alignment instead of human annotation). To our knowledge, the dataset is no longer publicly available. A recent segmentation-only model (Michel et al., 2017) achieved a boundary F of 75 on the TIMIT dataset (Fisher et al., 1986). However, because TIMIT is restricted to 10 unique utterances of English, we believe Zerospeech 2015, which contains more linguistically diverse speech from two unrelated languages, is a better dataset for investigating language acquisition patterns. F = 75). The overall segmentation performance in Xitsonga is considerably worse than that of English, consistent with prior evidence that word segmentation in the Zerospeech 2015 Xitsonga partition is harder than English (e.g. Kamper et al., 2017a). By contrast, classification metrics in Xitsonga are better than in English, which is again consistent with prior findings of stronger unsupervised classification performance in Xitsonga (Shain and Elsner, 2019). The difference in relative performance between segmentation and classification in the two languages could be due in part to differences in register: the English data is spontaneously produced, while the Xitsonga data is read speech. Longer average phoneme duration (100ms vs 70ms) and cleaner articulations in Xitsonga could plausibly give rise to this asymmetry, and further investigation is left to future work. The model substantially outperforms the untrained baseline (U) on all metrics and outperforms the cross-language baseline (X) on all metrics but boundary F in Xitsonga, which could be due in part to the larger size of the English-language training set. Results therefore indicate that the reconstruction and prediction objectives have contributed to unsupervised discovery of phonemic patterns in both languages.
Segmentation and macro-averaged classification F-measures by language and experimental condition are given in Figure 2. Results show a contribution of both memory (B > 0) and prediction (F > 0), with a similar distribution of relative performance between the two languages, supporting the existence of language-general influences of prediction and memory on phoneme learning.
As shown in Figure 2, models without memory pressures (B = 0) find substantially worse boundaries than models with memory pressures. There also appears to be a ceiling effect of backward reconstruction size, with a jump in performance at B = 25 but no systematic improvement at B = 50. Importantly, at layer 1, B = 25 covers a 250ms interval, which falls within even conservative estimates of the storage duration of unanalyzed auditory traces in humans (Cowan, 1984). The B = 25 objective could therefore plausibly be used during online speech processing. Prediction pressures also support discovery of phoneme boundaries, as shown by the generally worse boundary performance of F = 0 vs. F > 0 in both languages.
In addition, Figure 2 shows that memory and prediction both modulate phoneme classification performance, with a roughly convex performance surface around a peak at B = 25, F = 10 for English and B = 25, F = 5 for Xitsonga. A similar peak emerges in the feature classification results for English, along with a local feature classification peak in Xitsonga for L > 2. A 250ms auditory memory window thus supports both phoneme segmentation and classification in our models, with additional benefits from predicting over short intervals (Singer et al., 2018). For feature classification, the primary determinant of performance across languages is the prediction objective, with performance generally increasing up to F = 5. There is also an effect of encoder depth in these results, such that encoders with more layers (L > 2) tend to perform better across metrics, despite the fact that all metrics reflect performance at the first layer. This result supports a contribution of multiscale modeling, even if the segmentation behavior at higher layers does not clearly correspond to a theory-driven level of representation (see section 4.3). absence of these characteristics. Memory and prediction therefore modulate not only absolute performance, but also the utility of language experience. Figure 4 reports performance differences by metric against baseline X (cross-language). English segmentation is substantially helped by experience with (i.e. training on) English, especially under strong memory pressures. However, Xitsonga segmentation is generally worse for the Xitsongatrained model than the English-trained one. This might be due to the fact that the English training set is larger, and/or to low overall levels of segmentation performance in Xitsonga. While we leave further investigation of this exception to future work, the classification metrics still show a clear benefit of in-domain training in both languages, but only in the presence of prediction pressures.
The baseline X results bear on the degree to which speech processing patterns can plausibly be innately specified. Although the set of phonological categories and features are classically regarded as universal (Chomsky and Halle, 1968;Clements, 1985), it is well known that the "same" phonological abstraction (e.g. voicing) can be phonetically cached out in different ways depending on the language (e.g. Gordon and Ladefoged, 2001;Gordon et al., 2002). Our results suggest that, at least between Xitsonga and English, this variation is both (1) constrained enough to permit recognition of non-trivial patterns from speech in other languages on the basis of general, possibly innate processing biases, and (2) substantial enough to give rise to a benefit of direct experience with the target language, even for language-general constructs like phoneme categories and phonological features. We use linear regression on the combined metrics to quantitatively evaluate the contribution of both memory and prediction pressures to phoneme acquisition. Results show significant positive contributions to acquisition from memory pressures (p = 0.006), prediction pressures (p < 0.001), and multiscale encoding (p < 0.001). 8 The boundary precision/recall trade-off illuminates the mechanisms by which memory and prediction pressures affect learning ( Figure 5). Without memory pressures (B = 0), segmentation rates are high, resulting in high recall and low precision. Introducing memory pressures (B > 0) slows the segmentation rate, resulting in a more balanced P/R trade-off. Without prediction pressures (F = 0), segmentation rates are generally low, resulting in higher precision and lower recall. Introducing prediction pressures (F > 0) increases the segmentation rate, again resulting in a more balanced trade-off. To understand this pattern, recall that a boundary in our model represents both a cost 8 See Appendix G for details. (flushing the memory cell) and a benefit (injecting top-down feedback). The cost of forgetting is plausibly greater for reconstruction than prediction, since only the current layer has had direct access to the sequence of reconstruction targets. By contrast, the benefit of top-down feedback is plausibly greater for prediction than reconstruction, since the prediction can condition on contextual representations at multiple timescales. In our segmental model of speech processing, the objectives therefore induce countervailing biases that boost signal for phonological constructs, supporting their joint influence on phoneme acquisition from speech.

Conclusion
We proposed an unsupervised deep neural model of speech processing that is incremental, segmental, and optimized by local feedback. We manipulated the model's objective function in order to investigate prior hypotheses about the role in human language acquisition of memory constraints on the one hand and predictive processing on the other. Results support a role for both memory and prediction pressures for acquiring phonemes from speech. Both objectives inform the model's segmentation behavior and the content of its segment encodings. In addition, results suggest that these two mechanisms coordinate to support phoneme discovery by introducing countervailing pressures toward retention of previously encountered signals (memory) and consultation of top-down signals (prediction).  • The recurrent connection includes both the previous segment label and the current segment length in addition to the previous hidden state. We found this to be helpful during model development, and we hypothesize that this is because doing so removes the need for this information to be encoded by the model.

References
• We implement the case-wise reasoning of the segmentation decisions using multiplicative masking rather than logical selection. This is intended to boost signal into the boundary decisions.
• We enforce hierarchical segmentation behavior by multiplicatively masking the segmentation decision at layer l with the segmentation decision at layer l − 1, thus preventing higher layers from segmenting where lower layers do not.
• We compute boundaries during training via Bernoulli sampling rather than rounding. We found this to substantially improve performance on the development set, and we hypothesize that sampling may improve the straightthrough gradient estimates by ensuring that the segmentation decision is unbiased with respect to the underlying segmentation probability.
• We renormalize the preactivations s (l) t by the incoming boundary decisions (eq. A7). We found this to be helpful during model development, and we hypothesize that this is because it avoids fluctuation in the scale of preactivations as a function of the boundaries.
• We do not apply the Chung et al. (2017) technique of slope annealing, i.e. gradually increasing the steepness of the sigmoid activation function to reduce bias in the straightthrough estimator. We did not find an appreciable benefit from slope annealing during development, and it had a tendency to produce training instability. Eliminating it also reduces experimenter degrees of freedom by removing design decisions about the annealing function.

C Decoder Definition
The decoder consists of two attentional seq2seq LSTMs with L layers each, one backwarddirectional (memory) and one forward directional (prediction). Given a backward window size B and a forward window size F , each backward decoder layer generates reconstructions Y B(l) t ∈ R B×D l−1 and each forward decoder layer generates predictions Y F(l) t ∈ R F ×D l−1 , corresponding respectively to the B preceding and F following segment labels of layer l − 1 at time t. The initial decoder hidden and cell states - t,0 for the forward decoder -are generated using multilayer feedforward transforms f hB(l) , f cB(l) , f hF(l) , and f cF(l) : Decoder states are doubly time indexed by t, i, where t indexes the encoder timestamp (i.e. the input timestep at which decoding begins) and i indexes the decoder timestamp (i.e. progress through the B or F decoder frames). Given decoder states h dB(l) The decoder takes as input a periodic positional encoding e i , generated following Vaswani et al.
Attention weights a B(l) and a F(l) are computed using Gaussian kernel k(i; µ, σ 2 ): Kernel k is applied to decoder time, with concentration σ B(l) , σ F(l) = 0.25 and with location µ t,i ∈ R + computed by transforming the previous decoder state using a feedforward transform f qB(l) , f qF(l) and adding the result to the previous attention location: t,0 = 1. Unit-normalized attention vectors are computed from timestamp vectors t B def = (1, . . . , B) and t F def = (1, . . . , F ) as: The attention weights are thus constrained to march monotonically in time from t into the decoded past or future predicted segment labels from the layer above. Using fixed concentration 0.25 yields an effective kernel width [−2σ, 2σ] of one timestep, ensuring that the bulk of the attention kernel either falls on a single segment label or straddles two consecutive segment labels and preventing the decoder from spreading its attention over many higher-level segments. This design encourages one-to-many temporal alignment between decoded segment labels and decoded inputs, while allowing the decoder to determine how long to attend to a predicted segment label before moving on to the next one. At the final (top) layer, no top-down predictions are available, so the context vectors are omitted (or, equivalently, set to 0). The inputs to the decoder x dB(l) are constructed as the vertical concatenation of e, w, and the previously generated decoder output, and a standard LSTM state update is applied: The decoder is only applied to elements of f f(l) (1, T ) (i.e. only to frames where layer l segments) and only decodes the last B elements of f f(l−1) (1, t) and the first F elements of f f(l−1) (t + 1, T ); that is, it decodes only the B preceding segment labels and F following segment labels from layer l − 1, ignoring labels at non-boundaries. Therefore, like encoding, decoding is also multiscale, taking place at the timescale of the encoder representations.

D Objective
Each decoder layer contributes two terms to the objective, a forward objective and a backward objective. Layer 1 decodes the data and uses a squared error loss: Layers 2, . . . , L decode the representations from the layer below, which are tanh-activated and thus constrained to the interval (−1, 1). Encoder features h e(l) t are deterministically cast into bitwise feature probabilities p e(l) t and decoded using sigmoid cross-entropy loss: t,i,d respectively denote the backward and forward targets and model predictions at encoder time t, decoder time i, and dimension d, defined as follows: The backward and forward loss components L B(l) and L F(l) are computed as: The overall loss L is: L B(l) + L F(l) (A48)

E Implementation Details
We apply the following implementation decisions in this study: • D l = 128 for 1 ≤ l ≤ L • One hidden layer of 128 units for all feedforward transforms • Positional encoding dimensionality of 128 • Exponential linear unit (elu) activations for all internal feedforward layers (Clevert et al., 2015) • Glorot uniform initialization for bottom-up, top-down, and feedforward encoder and decoder weight matrices (Glorot and Bengio, 2010) • Orthogonal initialization for recurrent weight matrices (Saxe et al., 2013) • Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001, a minibatch size of 8, and default TensorFlow parameters.

F Data Preprocessing
We convert the audio recordings into sequences of 50-dimensional cochleagrams (Brown and Cooke, 1994;McDermott and Simoncelli, 2011), each representing 10ms of audio data. Although this differs from the standard automatic speech recognition pipeline based on Mel frequency cepstral coefficients (Mermelstein, 1976), it is motivated for our study because the model is unsupervised. Since we wish to test theories about cognition by extracting features from the acoustic stream without supervision, it is critical not only that the speech representation contain features that support identification of linguistic units, but that the representation emphasize those features in a plausibly similar manner to that of the human auditory system. Cochleagrams support this goal by incorporating more recent insights about human auditory perception (Mc-Dermott and Simoncelli, 2011). Our implementation uses the pycochleagram library https: //github.com/mcdermottLab/pycochleagram.
We L2 normalize the cochleagrams in order to encourage the decoder to focus on the spectral power envelope rather than absolute variation in loudness, since the former plausibly contains more linguistic signal. This procedure is supported by evidence of loudness constancy in human auditory perception, suggesting that similar kinds of normalization may take place in the brain (Zahorik and Wightman, 2001). We additionally z-transform the normalized cochleagrams over time within each audio file, since this proved beneficial during model development.
The source audio files contain many non-speech regions that are not of direct relevance for this study. We use the voice activity detection (VAD) intervals provided with the Zerospeech 2015 challenge data to remove these regions as a preprocess, and we force boundaries at the ends of VAD intervals. This greatly speeds training by removing irrelevant data, and it aligns with neuroscientific evidence of a prelinguistic capacity to detect human voices (Belin et al., 2000;Fecteau et al., 2005;Blasi et al., 2011;Pernet et al., 2015).

G Regression Model Design and Results
We use linear regression to test the relationship between performance and memory pressures, prediction pressures, and multiscale encoding. To do so, we combine raw boundary, phoneme classification, and feature classification metrics, along with deltas in these metrics over baselines U and X, into a single vector of performance statistics, each of which measures one aspect of the contribution of these dimensions to phoneme learning in our unsupervised models. To improve normality of performance metrics which are bounded on the interval [0, 1], as well as comparability of performance across metrics, we first (1) cast the metrics onto the interval   (3) Z-score the transformed vectors within each metric type. We use binary coding for our predictors of interest: presence/absence of memory pressures (B > 0), presence/absence of prediction pressures (F > 0), and presence/absence of multiscale segmental encoding (L > 2). We also include categorical controls for comparison type (full, full -baseline U, full -baseline X) and metric type (boundary, phoneme, feature). Results, shown in Table A1, support a contribution of all three critical variables to phoneme acquisition.