Pretrained Language Model Embryology: The Birth of ALBERT

While behaviors of pretrained language models (LMs) have been thoroughly examined, what happened during pretraining is rarely studied. We thus investigate the developmental process from a set of randomly initialized parameters to a totipotent language model, which we refer to as the embryology of a pretrained language model. Our results show that ALBERT learns to reconstruct and predict tokens of different parts of speech (POS) in different learning speeds during pretraining. We also find that linguistic knowledge and world knowledge do not generally improve as pretraining proceeds, nor do downstream tasks' performance. These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge. We will provide source codes and pretrained models to reproduce our results at https://github.com/d223302/albert-embryology.


Introduction
The world of NLP has gone through some tremendous revolution since the proposal of contextualized word embeddings. Some big names are ELMo (Peters et al., 2018), GPT (Radford et al.), and BERT (Devlin et al., 2019), along with its variants (Sanh et al., 2019;Liu et al., 2019b;Lan et al., 2019). Performance boosts on miscellaneous downstream tasks have been reported by finetuning these totipotent pretrained language models. With a view to better grasping what has been learned by these contextualized word embedding models, probing is generally applied to the pretrained models and the models finetuned from them. Probing targets can range from linguistic knowledge, including semantic roles and syntactic structures (Liu et al., 2019a;Tenney et al., 2019Tenney et al., , 2018Hewitt and Manning, 2019), to world knowledge (Petroni et al., 2019).
While the previous work focuses on what knowledge has been learned after pretraining of transformer-based language models, few delve into their dynamics during pretraining. What happened during the training process of a deep neural network model has been widely studied, including Gur-Ari et al. (2018), Frankle et al. (2019), Raghu et al. (2017), Morcos et al. (2018). Some previous works also study the dynamics of the training process of an LSTM language model (Saphra andLopez, 2018, 2019), but the training dynamics of a large scale pretrained language models are not well-studied. In this work, we probe ALBERT (Lan et al., 2019) during its pretraining phase every N parameter update steps and study what it has learned and what it can achieve so far. We perform a series of experiments, detailed in the following sections, to investigate the development of predicting and reconstructing tokens (Section 3), how linguistic and world knowledge evolve through time (Section 4, Section 6), and whether amassing those information serves as an assurance of good downstream task performances (Section 5).
We have the following findings based on AL-BERT: • The prediction and reconstruction of tokens with different POS tags have different learning speeds. (Section 3) • Semantic and syntactic knowledge is developed simultaneously in ALBERT. (Section 4) • Finetuning from model pretrained for 250k steps gives a decent GLUE score (80.23), and further pretrain steps only make the GLUE score rise as high as 81.50.
• While ALBERT does generally gain more world knowledge as pretraining goes on, the model seems to be dynamically renewing its knowledge about the world. (Section 6) While we only include the detailed results of ALBERT in the main text, we find that the results also generalize to the other two transformer-based language models, ELECTRA (Clark et al., 2019) and BERT, which are quite different from ALBERT in the sense of pretext task and model architecture. We put the detailed results of ELECTRA and BERT in the appendix.

Pretraining ALBERT
ALBERT is a variant of BERT with cross-layer parameters sharing and factorized embedding parameterization. The reason why we initially chose ALBERT as our subject lies in its parameter efficiency, which becomes a significant issue when we need to store 1000 checkpoints during the pretraining process.
To investigate what happened during the pretraining process of ALBERT, we pretrained an ALBERT-base model ourselves. To maximally reproduce the results in Lan et al. (2019), we follow most of the training hyperparameters in the original work, only modifying some hyperparameters to fit in our limited computation resources 2 . We also follow Lan et al. (2019), using English Wikipedia as our pretraining data, and we use the Project Guttenberg Dataset (Lahiri, 2014) instead of BookCorpus. The total size of the corpus used in pretraining is 16GB. The pretraining was done on a single Cloud TPU V3 and took eight days to finish 1M pretrain steps, costing around 700 USD. More details on pretraining are specified in appendix B.1.

Learning to Predict the Masked Tokens and Reconstruct the Input Tokens
During the pretraining stage of a masked LM (MLM), it learns to predict masked tokens based on the remaining unmasked part of the sentence, and it also learns to reconstruct token identities of unmasked tokens from their output representations of the model. Better prediction and reconstruction  Figure 1: Rescaled accuracy of token reconstruction and mask prediction during pretraining. We rescale the accuracy of each line by the accuracy when the model is fully pretrained, i.e., the accuracy after pretraining 1M steps. Token reconstruction are evaluated every 1K pretrain steps, and mask prediction evaluated every 5K steps. results indicate the model being able to utilize contextual information. To maximally reconstruct the input tokens, the output representations must keep sufficient information regarding token identities. We investigate the behavior of mask prediction and token reconstruction for tokens of different POS during the early stage of pretraining. We use the POS tagging in OntoNotes 5.0 (Weischedel et al., 2013) in this experiment. For the mask prediction part, we mask a whole word (which may contain multiple tokens) of an input sentence, feed the masked sentence into ALBERT, and predict the masked token(s). We evaluate the prediction performance by calculating the prediction's accuracy based on POS of the word; the predicted token(s) should exactly match the original token(s) to be deemed an accurate prediction. As for the token reconstruction part, the input to the model is simply ALBERT first learns to reconstruct function words, e.g., determiners, prepositions, and then gradually learns to reconstruct content words in the order of verb, adverb, adjective, noun, and proper noun. We also found that different forms and tenses of a verb do not share the same learning schedule, with third-person singular present being the easiest to reconstruct and present participle being the hardest (shown in Appendix C.2). The prediction results in Figure 1(b) reveal that learning mask prediction is generally more challenging than token reconstruction. ALBERT learns to predict masked tokens with an order similar to token reconstruction, though much slower and less accurate. We find that BERT also learns to perform mask prediction and token reconstruction in a similar fashion, with the results provided in Appendix C.4.

Probing Linguistic Knowledge Development During Pretraining
Probing is widely used to understand what kind of information is encoded in embeddings of a language model. In short, probing experiments train a task-specific classifier to examine if token embeddings contain the knowledge required for the probing task. Different language models may give different results on different probing tasks, and representations from different layers of a language model may also contain different linguistic information (Liu et al., 2019a;Tenney et al., 2018).
Our probing experiments are modified from the "edge probing" framework in Tenney et al. (2018). Hewitt and Liang (2019) previously showed that probing models should be selective, so we use linear classifiers for probing. We select four probing tasks for our experiments: part of speech (POS) tagging, constituent (const) tagging, coreference (coref) resolution, and semantic role labeling (SRL). The former two tasks probe syntactic knowledge hidden in token embeddings, and the last two tasks are designed to inspect the semantic knowledge provided by token embeddings. We use annotations provided in OntoNotes 5.0 (Weischedel et al., 2013) in our experiments.
The probing results are shown in Figure 2b. We observe that all four tasks show similar trends during pretraining, indicating that semantic knowledge and syntactic knowledge are developed simultaneously during pretraining. For syntactically related tasks, the performance of both POS tagging and constituent tagging boost very fast in the first 100k pretrain steps, and no further improvement can be seen throughout the remaining pretraining process, while performance fluctuates from time to time. We also observe an interesting phenomenon: the probed performances of SRL peak at around 150k steps and slightly decay over the remaining pretraining process, suggesting that some information in particular layers related to probing has been dwindling while the ALBERT model strives to advance its performance on the pretraining objective. The loss of the pretraining objective is also shown in Figure 2a.
Scrutinizing the probing results of different layers ( Figure 3 and Appendix D.3), we find that the behaviors among different layers are slightly different. While the layers closer to output layer perform worse than layers closer to input layer at the beginning of pretraining, their performances rise drastically and eventually surpass the top few layers; however, they start to decay after they reach best performances. This implies the last few layers of ALBERT learn faster than the top few layers. This phenomenon is also revealed by observing the attention patterns across different layers during pretraining. Figure 4 shows that the diagonal attention pattern (Kovaleva et al., 2019) of layer 8 emerges earlier than layer 2, with the pattern of layer 1 looms the last 3 .  GLUE scores, whether the performance gain of downstream tasks is proportional to the resources spent on additional pretrain steps is unknown. This drives us to explore the downstream performance of the ALBERT model before fully pretrained. We choose GLUE benchmark (Wang et al., 2018) for downstream evaluation, while excluding WNLI, following Devlin et al. (2019). We illustrate our results of the downstream performance of the ALBERT model during pretraining in Figure 5. While the GLUE score gradually increases as pretraining proceeds, the performance after 250k does not pale in comparison with a fully pretrained model (80.23 v.s. 81.50). From Figure 5b, we also observe that most GLUE tasks reach comparable results with their fully pretrained counterpart over 250k pretrain steps, except for MNLI and QNLI, indicating NLI tasks do benefit from more pretrain steps when the training set is large.

Does Expensive and Lengthy
We also finetuned BERT and ELECTRA models as pretraining proceeds, and we observe similar trends. The GLUE scores of the BERT and ELEC-TRA model rise drastically in the first 100k pretrain steps, and then the performance increments less slowly afterward. We put the detailed result of these two models in Section E.4.
We conclude that it may not be necessary to train an ALBERT model until its pretraining loss converges to obtain exceptional downstream performance. The majority of its capability for downstream tasks has already been learned in the early stage of pretraining. Note that our results do not contradict previous findings in Devlin et al. (2019) The results are shown in Figure 6, in which we observe that world knowledge is indeed built up during pretraining, while performance fluctuates occasionally. From Figure 6, it is clear that while some types of knowledge stay static during pretraining, some vary drastically over time, and the result of a fully pretrained model (at 1M steps) may not contain the most amount of world knowledge. We infer that world knowledge of a model depends on the corpus it has seen recently, and it tends to forget some knowledge that it has seen long ago. These results imply that it may not be sufficient to draw a conclusion on ALBERT's potential as a knowledge base merely based on the final pretrained one's behavior. We also provide qualitative results in Appendix F.2.  Table 1 Type

Conclusion
Although finetuning from pretrained language models puts in phenomenal downstream performance, the reason is not fully uncovered. This work aims to unveil the mystery of the pretrained language model by looking into how it evolves. Our findings show that the learning speeds for reconstructing and predicting tokens differ across POS. We find that the model acquires semantic and syntactic knowledge simultaneously at the early pretraining stage. We show that the model is already prepared for finetuning on downstream tasks at its early pretraining stage. Our results also reveal that the model's world knowledge does not stay static even when pretraining loss converges. We hope our work can bring more insights into what makes a pretrained language model a pretrained language model.

A Modifications from the Reviewed Version
We made some modifications in the camera-ready version, mostly based on the reviewers' recommendations and for better reproducibility.
• We add the result of BERT and ELECTRA in Section 3, Section 4, and Section 5.
• We reimplement the source code for Section 4 and renew the experiment results accordingly. While the exact values are slightly different, the general trends are the same and do not affect our observation.
• We add the results of coreference resolution in our probing experiments, following the reviewers' suggestion.
• We polish our wordings and presentations in text and figures.

B.1 ALBERT
As mentioned in the main text, we only modified a few hyperparameters to fit in out limited computation resources, listed in Ta

C.2 Mask Predict and Token Reconstruction of Different Verb Forms
We provide supplementary materials for Section 3. In Figure 7, we observe that ALBERT learns to reconstruct and predict verb of different forms at different times. The average occurrence rate of verb in different form from high to low is V-es, V-ed, V, V-en, V-ing, which coincides with the priority being leaned. Figure 7: Token reconstruction (7a) and mask prediction (7b) accuracy. We also rescale the accuracy as in Figure 1.

C.3 How Does Occurrence Frequency Affect Learning Speed of A Word?
In the main text, we observe that words of different POS are learned at different times of pretraining.
We also pointed out that the learning speed of different POS roughly corresponds to their occurrence rate. However, it is not clear to what extent a word's occurrence frequency affects how soon it can be learned to reconstruct or mask-predict by the model. We provide a deeper analysis of the relationship between the learning speed of a word and its occur-

C.4 Mask Predict and Token Reconstruction of BERT
We provide the results of BERT's token reconstruction and mask prediction in Figure 9. We observe content words are learned later than function words, while the learning speed is faster than ALBERT. To be more specific, we say a word type A is learned faster than another word type B if either the learning curve of A rises earlier than B from 0, or if the rescaled learning curve of A is steeper than that of B.

D.1 Probing Model Details
As mentioned in the main text, we modified and reimplemented the edge probing (Tenney et al., 2018) models in our experiments. The modifications are detailed as follow: • We remove the projection layer that projects representation output from the language model to the probing model's input dimension.
• We use average pooling to obtain span representation, instead of self-attention pooling.   • We use linear classifiers instead of 2-layer MLP classifiers.
• We probe the representation of a single layer, instead of concatenating or scalar-mixing representations across all layers.
Since our probing models are much simpler than those in Tenney et al. (2018), probing results might be inferior to the original work. The number of model's parameters in our experiments is approximately 38K for POS tagging, 24K for constituent tagging, and 100K for SRL.

D.2 Dataset
We use OntoNotes-5.0, which can be download from https://catalog.ldc.upenn.edu/ LDC2013T19. The statistics of this dataset is in Table 5.

D.3 SRL, Coreference Resolution, and Constituent Labeling Results
Here in Figure 10

D.4 Probing Results of BERT and ELECTRA
We provide the probing results of BERT and ELEC-TRA in Figure 11. All the probing experiments of ALBERT, BERT, and ELECTRA share the same set of hyperparameters and model architectures.
We observe a similar trend as ALBERT: the probing performance rises quite quickly and plateaus (or even slightly decay) afterward. We also found that performance drop of those layers closer to ELEC-TRA's output layers are highly observable, which may spring from its discriminative pretraining nature.

E.2 Finetune Details
We use the code in https://github. Here we provide performance of individual tasks in GLUE benchmark on development set in Figure 12, along with performance of SQuAD2.0 (Rajpurkar et al., 2018).

E.4 Downstream performance of BERT and ELECTRA
We use the same hyperparamters in Table7 to finetune BERT and ELECTRA models. Except for the performance of BERT on SQuAD2.0, all the other results are comparable with those results finetuned from the official Google pretrained models. We can observe from Figure 13 and Figure 12 that all three models' performance on downstream tasks show similar trends: Performance skyrocketed during the initial pretraining stages, and the return gradually decays later. From Figure 13c, we also find that among the three models, ALBERT plateaus the earliest, which may result from its parameter-sharing nature.

F.1 Dataset Statistics
In our experiment of world knowledge, we only use 1-1 relations (P1376 and P36) and N-1 relations (the rest relations in Table 8). Among those relations, we only ask our model to predict object ([Y] in the template in Table 8) that has only one token, following Petroni et al. (2019). From those relations, we report world knowledge that behaves differently during pretraining in Figure 6: we select the knowledge that can be learned during pretraining (e.g., P176), the knowledge that cannot be learned during the whole pretraining process (e.g., P140), the knowledge that was once learned and then forgotten after pretraining (e.g., P138), and knowledge that kept oscillating during pretraining (e.g., P407). The statistics of all world knowledge evaluated are in listed in Table 8.

F.2 Qualitative Results and Complete World Knowledge Results
We provide qualitative examples for Section 6 in Table 9. We also provide the complete results of all world knowledge we use in Figure 14. the lumia 800 is produced by nokia. hamburg airport is named after it. 200K nokia lu nokia 800 is produced by nokia.
hamburg airport is named after hamburg. 500K nokia lumia 800 is produced by nokia. hamburg airport is named after him. 1M nokia lumia 800 is produced by nokia. hamburg airport is named after him. Table 9: Example results of world knowledge evolution during pretraining. We can observe that model successfully predict the object in the Nokia example since 100K steps, and doesn't forget during the rest pretraining process. On the other hand, the model is only able to correctly predict Hamburg in the second example at 200K steps, and failed to predict at other pretrain steps.