Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Punctuation restoration is a common post-processing problem for Automatic Speech Recognition (ASR) systems. It is important to improve the readability of the transcribed text for the human reader and facilitate NLP tasks. Current state-of-art address this problem using different deep learning models. Recently, transformer models have proven their success in downstream NLP tasks, and these models have been explored very little for the punctuation restoration problem. In this work, we explore different transformer based models and propose an augmentation strategy for this task, focusing on high-resource (English) and low-resource (Bangla) languages. For English, we obtain comparable state-of-the-art results, while for Bangla, it is the ﬁrst reported work, which can serve as a strong baseline for future work. We have made our developed Bangla dataset publicly available for the research community.


Introduction
Due to the recent advances in deep learning methods, the accuracy of Automatic Speech Recognition (ASR) systems has increased significantly (e.g., 3.4% WER on LibriSpeech noisy test set (Park et al., 2020)).The improved performance of ASR enabled the development of voice assistants (e.g., Siri, Cortana, Bixby, Alexa, and Google Assistant) and their wider use at the user end.Among different components (e.g., acoustic, language model), pre-and post-processing steps, the punctuation restoration is one of the post-processing steps that also needs to be dealt with to improve the readability and utilize the transcriptions in the subsequent NLP applications (Jones et al., 2003;Matusov et al., 2007).1This is because state-of-the-art NLP models are mostly trained using punctuated texts (e.g., texts from newspaper articles, Wikipedia).Hence, the lack of punctuation significantly degrades performance.For example, there is a performance difference of more than ∼ 10% when the model is trained with newspaper texts and tested with transcriptions for the Named Entity Recognition system (Alam et al., 2015).
To address this issue, most of the earlier efforts on the punctuation restoration task have been done using lexical, acoustic, prosodic, or a combination of these features (Gravano et al., 2009;Levy et al., 2012;Zhang et al., 2013;Xu et al., 2014;Szaszák and Tündik, 2019;Che et al., 2016a).For the punctuation restoration task, lexical features have been widely used because the model can be trained using any punctuated text (i.e., publicly available newspaper articles or content from Wikipedia) and because of the availability of such large-scale text.This is a reasonable choice as developing punctuated transcribed text is a costly procedure.
In terms of machine learning models, conditional random field (CRF) has been widely used in earlier studies (Lu and Ng, 2010;Zhang et al., 2013).Lately, the use of deep learning models, such as Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and transformers have also been used (Che et al., 2016b;Gale and Parthasarathy, 2017;Zelasko et al., 2018;Wang et al., 2018) for this task.
There has been a variant of transformer based language models (e.g., BERT (Devlin et al., 2019a), RoBERTa (Liu et al., 2019)), which have not been explored widely to address this problem.Hence, we aimed to explore different architectures and finetune pre-trained models for this task focusing on English and Bangla.Punctuation restoration models are usually trained on clean texts but used on noisy ASR texts.As such, the performance may degrade due to errors introduced by ASR models which are not present in the training data.We design an augmentation strategy (see Section 4.1.2) to address this issue.For English, we train and evaluate the models using IWSLT reference and ASR test datasets.We report that our proposed augmentation strategy yields a 3.8% relative improvement in the F1 score on ASR transcriptions for English and obtains state-of-the-art results.For Bangla, there has not been any prior reported work for punctuation restoration.In addition, no resource has been found.Therefore, we prepare a training dataset from a news corpus and provide strong baselines for news, reference, and ASR transcriptions.To shade light in the current state-of-the-art on punctuation restoration task, our contributions in this study are as follows: 1. Explore transformer based language models for the punctuation restoration task.2. Propose an augmentation strategy.

Prepare training and evaluation datasets for
Bangla and provide strong benchmark results.4. We have made our source code and datasets publicly available.2 We organize the rest of the paper as follows.In Section 2, we discuss recent works based on lexical features.We describe English and Bangla datasets used in this study in Section 3. Experimental details are provided in Section 4. We compare our results against other published results on the IWSLT dataset and provide benchmark results on the Bangla dataset in Section 5. We conclude the paper in Section 6.

Related Work
Recent lexical features based approaches for punctuation restoration tasks are predominantly based on deep neural networks.Che et al. (2016b) used pre-trained word embeddings to train feedforward deep neural network and CNN.Their result showed improvements over a CRF based approach that uses purely text data.
Since context is important for this type of task, several studies explored the recurrent neural network (RNN) based architectures combined with CRF and pre-trained word vectors.For instance, Tilk and Alumäe (2016) used a bidirectional recurrent neural network (RNN) with an attention mechanism to improve performance over DNN and CNN models.In another study, Gale and Parthasarathy (2017) used character-level LSTM architecture to achieve results that are competitive with the wordlevel CRF based approach.Yi et al. (2017) combined bidirectional LSTM with a CRF layer and an ensemble of three networks.They further used knowledge distillation to transfer knowledge from the ensemble of networks to a single DNN network.
Transformer based approaches have been explored in several studies (Yi and Tao, 2019;Nguyen et al., 2019).Yi and Tao (2019) combined pretrained word and speech embeddings that improves performance compared to only word embedding based model.Nguyen et al. (2019) used transformer architecture to restore both punctuation and capitalization.Punctuation restoration is also important for machine translation.The study by Wang et al. (2018) used a transformer based model for spoken language translation.They achieved significant improvements over CNN and RNN baselines, especially on joint punctuation prediction task.
More recent approaches are based on pre-trained transformer based models.Makhija et al. (2019) used pre-trained BERT (Devlin et al., 2019a) model with bidirectional LSTM and a CRF layer to achieve state-of-the-art result on reference transcriptions.Yi et al. (2020) used adversarial multitask learning with auxiliary parts of speech tagging task using a pre-trained BERT model.
In this study, we also explore transformer based models; however, unlike prior works that solely studied one architecture (BERT), we experiment with different models.We also propose a novel augmentation scheme that improves the performance.Our augmentation is closely related to the augmentation techniques proposed in (Wei and Zou, 2019b) where the authors consider synonym replacement, random insertion, random swap, and random deletion.While their work is intended for the text classification tasks, we propose a different version of it for this study, which is a sequence labeling task.We do not use synonym replacement and random swap as they do not usually appear in speech transcription.

English Dataset
We use IWSLT dataset for English punctuation restoration, which consists of transcriptions from TED Talks. 3 (Cettolo et al., 2013;Federico et al., 2012), later, Che et al. (2016b) prepared and released a refined version of the IWSLT dataset publicly.For this study, we use the same train, development, and test splits released by Che et al. (2016b).The training and development set consist of 2.1M and 296K words, respectively.Two test sets are provided with manual and ASR transcriptions, each containing 12626 and 12822 words, respectively.These are taken from the test data of IWSLT2011 ASR dataset. 3A detailed description of the dataset can be found in (Che et al., 2016b).There are four labels including three punctuation marks: (i) Comma: includes commas, colons and dashes, (ii) Period: includes full stops, exclamation marks and semicolons, (iii) Question: only question mark, and (iv) O: for any other token.

Bangla Dataset
To the best of our knowledge, there are no publicly available resources for the Bangla punctuation restoration task.Hence, we prepare a dataset using a publicly available corpus of Bangla newspaper articles (Khatun et al., 2019).This dataset is available in train and test splits.For our task, we selected 4000 and 500 articles respectively for preparing training and development sets from their train split, and 200 articles for test from their test split.Training, development, and test sets consist of 1.38M, 180K, and 88K words respectively.
Additionally, we prepare two test datasets consisting of manual and ASR transcriptions to evaluate the performance.We collected 65 minutes of speech excerpts extracted from four Bangla short stories (i.e., monologue read speech). 4These are manually transcribed with punctuation.We obtained ASR transcriptions for the same audios using Google Cloud speech API.5 Note that the Google speech API does not provide punctuation for Bangla.The obtained ASR transcriptions from Google speech API are then manually annotated with punctuation.We computed the Word Error Rate (WER) of the ASR transcriptions by comparing against our manual transcriptions, which results in 14.8% WER.The number of words in manual and ASR transcriptions consists of 6821 and 6417 words respectively.Similar to English, we consider four punctuation marks for Bangla i.e., Period, Comma, Question, and O.
In Table 1, we present the distributions of the labels for both English and Bangla.In parenthesis, we provide the percentage of the punctuation.In general, the distribution of questions is low (less than 1%), which we observe both in English and Bangla news data.However, this is much higher in the Bangla manual and ASR transcriptions.This is due to the fact that these texts are selected from short stories where people often engage in conversation and ask each other questions.The literary style of the stories is different from news and as a result, the distribution of Period is also higher in the Bangla manual and ASR transcriptions.This results in a much smaller average sentence length in these datasets, as can be seen in Table 2.We can compare these numbers with English as reported in (Zelasko et al., 2018).The authors reported 79.1% O token on the training data collected from conversational speech.We have 78.82%O token on our reference test data.This suggests that our transcribed data are more similar in distribution to natural conversations.

Experiments
For this study, we explored different transformer based models for both English and Bangla.In addition, we used bidirectional LSTM (BiLSTM) on top of the pre-trained transformer network, and an augmentation method to improve the performance of the models.

Models and Architectures
In Figure 1, we report a general network architecture that we used in our experiments.We obtained d dimensional embedding vector from the pre-trained language model for each token.This is used as input for a BiLSTM layer, consisting of h hidden units.This allows the network to make effective use of both past and future contexts for prediction.The outputs from the forward and backward LSTM layers are concatenated at each time step and fed to a fully connected layer with four output neurons, which correspond to 3 punctuation marks and one O token.
As can be seen in the Figure, the input sentence "when words fail music speaks" does not have any punctuation, and the task of the model is to predict a Comma after the word "fail" and Period after the word "speaks" to produce the output sentence "when words fail, music speaks." We measure the performance of the models in terms of precision (P ), recall (R), and F1-score (F 1 ).

Pretrained Language Model
Transfer learning has been popular in computer vision, and the emergence of the transformers (Vaswani et al., 2017) has shown the light to use transfer learning in NLP applications.The models are trained on a large text corpus (e.g., BERT is trained on 800M words from Book Corpus and 2,500M words from Wikipedia) and their success has been proven by fine-tuning downstream NLP applications.In our experiment, we used such pre-trained language models for the punctuation restoration task.We briefly discuss the monolingual language models for English and multi-lingual language models used in this study.
BERT (Devlin et al., 2019a) is designed to learn deep bidirectional representation from unlabeled texts by jointly conditioning the left and right contexts in all layers.It uses a multi-layer bidirectional transformer encoder architecture (Vaswani et al., 2017) and makes use of two objectives during pretraining: masked language model (MLM) and next sentence prediction (NSP) task.
RoBERTa (Liu et al., 2019) performs a replication study of BERT pretraining and shows that improvements can be made using larger datasets, vocabulary, and training on longer sequences with bigger batches.It uses dynamic masking of the tokens i.e., the masking pattern is generated every time a sequence is fed to the model instead of generating them beforehand.They also remove the NSP task and use only MLM loss for pretraining.
ALBERT (Lan et al., 2020) incorporates a couple of parameter reduction techniques to design an architecture with significantly fewer parameters than a traditional BERT architecture.The first improvement is factorizing embedding parameters by decomposing the embedding matrix V × H into two smaller matrices V × E and E × H, where V is the vocabulary size, E is the word piece embedding size, and H is hidden layer size.This reduces embedding parameters from O(V × H) to O(V × E + E × H), which can be significant when E << H.The second improvement is parameter sharing across layers.This prevents the parameter from growing as the depth is increased.The NSP task introduced in BERT is replaced by a sentence-order prediction (SOP) task in ALBERT.
DistilBERT (Sanh et al., 2019) uses knowledge distillation from BERT to train a model that has 40% fewer parameters and is 60% faster while retaining 97% of language understanding capabilities of the BERT model.The training objective is a linear combination of distillation loss, supervised training loss, and cosine distance loss.
Multilingual Models MLM has also been utilized for learning language models from large scale multi-lingual corpora.BERT multilingual model (mBERT) is trained on more than 100 languages with the largest Wikipedia dataset.To account for the variation among Wikipedia sizes of different languages, data is sampled using an exponentially smoothed weighting (with a factor 0.7) so that highresource languages like English are under-sampled compared to low resource languages.Word counts are weighted the same way as the data so that lowresource language vocabularies are up weighted by some factor.
Cross-lingual models (XLM) (Conneau and Lample, 2019) use MLM in multiple language settings, similar to BERT.Instead of using a pair of sentences, an arbitrary number of sentences are used with text length truncated at 256 tokens.
XLM-RoBERTa (Conneau et al., 2020) is trained with a multilingual MLM objective similar to XLM but on a larger dataset.It is trained in one hundred languages, using more than two terabytes of filtered Common Crawl data (Wenzek et al., 2020).

Augmentation
For this study, we propose an augmentation method inspired by the study of Wei and Zou (2019a), as discussed earlier.Our augmentation method is based on the types of error ASR makes during recognition, which include insertion, substitution, and deletion.
Due to the lack of large-scale manual transcriptions, punctuation restoration models are typically trained using written text, which is well-formatted and correctly punctuated.Hence, the trained model lacks the knowledge of the typical errors that ASR makes.To train the model with such characteristics, we use an augmentation technique that simulates such errors and dynamically creates a new sequence on the fly in a batch.Dynamic augmentation is different from the traditional augmentation approach widely used in NLP (Wei and Zou, 2019a); however, it is widely used in computer vision for image classification tasks (Cubuk et al., 2020).
The three different kinds of augmentation corresponding to three possible errors are as follows.
1. First (i.e., substitution), we replace a token by another token.In our experiment, we randomly replace a token with the special unknown token.2. Second (i.e., deletion), we delete some tokens randomly from the processed input sequence.3. Finally, we add (i.e., insertion) the unknown token at some random position of the input.
We hypothesize that not all three errors are equally prevalent, hence, different augmentation will have a different effect on performance.Keeping this in mind, to process input text, we used three tunable parameters: (i) a parameter to determine token change probability, α, (ii) a parameter, α sub , to control the probability of substitution, (iii) a parameter, α del , to control the probability of deletion.Probability of insertion is given by 1 − (α sub + α del ).
When applying substitution, we replaced the token in that position with the unknown token and left the target punctuation mark unchanged.For deletion, both the token and the punctuation mark in that position are deleted.For insertion, we inserted the unknown token and O token, in that position.
Since deletion and insertion operation may make the sequence smaller or longer than the fixed sequence length we used during training, we added padding or truncated as necessary.

Experimental Settings
We used pre-trained models available in the Hug-gingFace's Transformers library (Wolf et al., 2019).More details about different architectures can be found on HuggingFace website. 6For tokenization, we used model-specific tokenizers.
During training, we used a maximum sequence length of 256.Each sequence starts with a special start of sequence token and ends with a special end of sequence token.Since the tokenizers use byte-pair encoding (Sennrich et al., 2016), a word may be tokenized into subword units. 7If adding the subword tokens of a word results in sequence length exceeding 256, we excluded those tokens Table 3: Results on IWSLT2011 manual (Ref.) and ASR transcriptions of test sets.Highlighted rows are the comparable results between ours and previous study.For overall best results we use bold form, and for the best F1 of individual punctuation we use a combination of bold and italic form.
from the current sequence and start the next sequence from them.We use padding token after the end of sequence token to fill the remaining slots of the sequence.Padding tokens are masked to avoid performing attention on them.We use a batch size of 8 and shuffle the sequences before each epoch.Our chosen learning rates are 5e-6 for large models, and 1e-5 for base models, which are optimized using the development set.LSTM dimension h is set to the token embedding dimension d.All models are trained with Adam (Kingma and Ba, 2015) optimization algorithm for 10 epochs.Other parameters are kept as the default settings, discussed in (Devlin et al., 2019b).The model with the best performance on the development set is used for evaluating the test datasets.

Results on English Dataset
In Table 3, we report our experimental results with a comparison from previous results on the same dataset.We provide the results obtained using BERT, RoBERTa, ALBERT, DistilBERT, mBERT, XLM-RoBERTa models without augmentation.Large variants of the models perform better than the Base models.Monolingual models perform better than their multilingual counterparts.RoBERTa achieves a better result than other models as it was trained on a larger corpus and has a larger vocabulary.Our best result is obtained using the RoBERTa model with augmentation in which the parameters were α = 0.15, α sub = 0.4, α del = 0.4.Performance gain from augmentation comes from improved precision.
We obtained the state of the art result on both test sets in terms of the overall F 1 score (rows are highlighted).On Ref. test set, we obtained the best result on Comma, and comparable results for Question (highlighted using a combination of the bold and italic form).However, SAPR (Wang et al., 2018) method performed much better compared to others for Period on this data.On ASR test set, our result is marginally better than Yi et al. (2020) for overall F 1 score.Our model performed better for Period but comparatively lower for Comma and Question.Overall, our model has better recall than precision on this dataset.

Results on Bangla Dataset
In Table 4, we report results on the Bangla test set comprised of news, manual, and ASR transcriptions.Since no monolingual transformer model is publicly available for Bangla, we explored different multilingual models.We obtained the best result using XLM-RoBERTa (large) model as it is trained with more texts for low-resource languages like Bangla and has larger vocabulary for them.This is consistent with the findings reported in (Liu et al., 2019), where the authors report improvement over multi-lingual BERT and XLM models in cross-lingual understanding tasks for low-resource languages.We apply augmentation on XLM-RoBERTa model and best result is obtained using augmentation parameters α = 0.15, α sub = 0.4, and α del = 0.4.However, the performance gain from augmentation is marginal on  For many applications (e.g., semi-automated subtitles generation), it is of utmost importance to facilitate human labelers to reduce their time and effort and make the manual annotation process faster.In such cases, identifying the correct position of the punctuations is important, as reported in (Che et al., 2016b).For Bangla, we wanted to understand what we can gain while merging the punctuation and identifying their position.For this purpose, we evaluate performance on 3-Classes and 2-Classes test sets.We combine Period and Question together to form the 3-classes test sets.Comma is further combined with those to form the 2-Classes test sets, i.e., punctuation or no punctuation.In Table 5, we report the results with binary and multiclass settings using XLM-RoBERTa (large) model coupled with augmentation.As can be seen, the model performs well for predicting punctuation positions.For manual (Ref.) and ASR transcriptions, we have a significant gain while merging the number of classes from four towards two.It could be becauseas the number of classes reduces, the classifier's complexity reduces, which leads to an increase in the model's performance.The performance gain is comparatively lower for news while merging four classes into three classes; however, it increased significantly when reduced to two.Considering these findings, we believe this type of model can help human annotators in such applications.

Ablation Studies
We experimented with using CRF after the linear layer for predicting the most probable tag sequence instead of using the softmax layer.However, we did not notice any performance improvement and even a slight decrease in ASR test data performance.The results using RoBERTa large model are reported in Table 6.
We also analyzed the effect on performance when substitution, insert and delete augmentations are applied in isolation.These results are reported in table 7 for RoBERTa large model.We explored substitution with a random token from vocabulary (reported in row Substitution (α = 0.15, random).However, it performed worse compared to substituting with the unknown token.We notice that the performance gain from different augmentations is larger on the ASR test set than the reference test set.

Discussion
For English, we obtained state-of-art results for manual and ASR transcriptions using our augmentation technique coupled with the RoBERTa-large model.There is still a large difference between manual and ASR transcriptions results.In Figure 2, we report the confusion matrix (in percentage), for manual and ASR transcriptions.From the figure, we observe that for ASR transcriptions, a high proportion of cases Question and Comma are predicted as O and Period.We will investigate this finding further in our future study.
Compared to English, the performance of Bangla is relatively low.We hypothesize several factors are responsible for this.First, the pre-trained monolingual language models for English usually perform better than multilingual models.Even in the case of multilingual models, the content of the English language is higher in the training data, and as a result, the models are expected to perform better for English.Second and perhaps a more important factor is the nature of training data.For Bangla, due to the lack of punctuated transcribed data, we used a news corpus for training.Hence, the trained model does not learn the nuances of transcriptions, which reduces prediction accuracy.Third, our ASR transcriptions are taken from some story excerpts, containing monologue and a significant amount of conversations (dialogue), which varies in terms of complexity (e.g., the dialogue has interruptions and overlap, short vs long utterance).An aspect of such a complexity is also evident in Table 1, where we see that the distribution of Period is almost double compared to news data and the distribution of Question is more than six times greater.On the other hand, for English, both train and test data are taken  from TED talks, and there is no such discrepancy between the data distributions.
Similarly to English, we also wanted to see error cases for Bangla.In Figure 3, we report the confusion matrix.We observed similar phenomenon as English for Bangla, comparatively much higher in proportion, i.e., Question and Comma are predicted as O and Period for news, manual and ASR transcriptions.

Conclusion
In this study, we explore different transformer models for high-and low-Resource languages (i.e., English and Bangla).In addition, we propose an augmentation technique, which improves performance on noisy ASR texts.There has not been any reported result and resources for punctuation restoration on Bangla.Our study, findings, and developed resources will enrich and push the current state-of-art for this low-resource language.We have released the created Bangla dataset and code for the research community.

Figure 1 :
Figure 1: A general model architecture for our experiments.

Figure 2 :
Figure 2: Confusion matrix (in percentage) for English test datasets.

Table 1 :
Though this dataset was originally released in the IWSLT evaluation campaign in 2012 Distributions of English and Bangla datasets.The number in parenthesis represents percentage.

Table 2 :
Average sentence length (Avg.) with standard deviation (Std.) for each language.

Table 4 :
Result on Bangla test datasets.

Table 5 :
Result on Bangla test datasets by merging classes.theBangla dataset.Overall, performance on the news test set is better compared to the manual and ASR data.Performance for Comma is lower than Period and Question.Compared to English, we notice a performance drop of about 10% for Period and Question, but for Comma, this is more than 30% on the ASR test set.

Table 7 :
Results of Augmentation IWSLT2011 Ref. and ASR test data with RoBERTa-large model