Attentively Embracing Noise for Robust Latent Representation in BERT

Modern digital personal assistants interact with users through voice. Therefore, they heavily rely on automatic speech recognition (ASR) in order to convert speech to text and perform further tasks. We introduce EBERT, which stands for EmbraceBERT, with the goal of extracting more robust latent representations for the task of noisy ASR text classification. Conventionally, BERT is fine-tuned for downstream classification tasks using only the [CLS] starter token, with the remaining tokens being discarded. We propose using all encoded transformer tokens and further encode them using a novel attentive embracement layer and multi-head attention layer. This approach uses the otherwise discarded tokens as a source of additional information and the multi-head attention in conjunction with the attentive embracement layer to select important features from clean data during training. This allows for the extraction of a robust latent vector resulting in improved classification performance during testing when presented with noisy inputs. We show the impact of our model on both the Chatbot and Snips corpora for intent classification with ASR error. Results, in terms of F1-score and mean between 10 runs, show that our model significantly outperforms the baseline model.


Introduction
In recent year, machine learning methods have significantly contributed to the development of automatic speech recognition (ASR) (Deng and Li, 2013;Padmanabhan and Johnson Premkumar, 2015;Benkerzaz et al., 2019;Nassif et al., 2019). The fast advances in the ASR field can be explained by the urgent need of efficiently bridging the gap between humans and the ever evolving technologies surrounding us. As technology becomes smaller and more portable, it becomes clear that the only feasible way of interaction is through a digital personal assistant with voice interface (Tulshan and Dhage, 2018). However, high error rates still prevail and pose a tremendous hurdle in the widespread adoption of speech technology by users worldwide (Zavareh et al., 2013;Errattahi et al., 2018).
Researchers in the computational linguistics community analyze noisy text (Subramaniam, 2010) obtained through informal settings, such as social networks, or through processing techniques, such as optical character recognition (Granet et al., 2018;Tomokiyo et al., 2018), ASR (Vinciarelli, 2005;Agarwal et al., 2007;Shrestha et al., 2019), or machine translation models (Sergio et al., 2018). We focus on noisy ASR text classification. Researchers in this area have simulated noise introduced through ASR systems and they have investigated the performances of Naive Bayes and Support Vector Machine (SVM) classifiers in noisy text classification (Vinciarelli, 2005;Agarwal et al., 2007). Shrestha et al. (2019) have also investigated the impact in performance of using text embedded with speech data with neural networks and logistic regression. The aforementioned researches, however, focus mainly on investigating the impact that noise has in the performance of shallow machine learning models, such as SVM or vanilla neural network, rather than focusing on extracting more robust input representation for improved performance. More recently, the transformer (Vaswani et al., 2017) based model BERT (Devlin et al., 2018) has achieved state of the art in various natural language processing tasks. This inspired Hrinchuk et al. (2020) to suggest the use of a transformer encoder-decoder architecture for the task of ASR correction and Liao et al. (2020) to improve readability also using transformer-based sequence-to-sequence architectures for the same task.
The pre-trained model BERT is fine-tuned for downstream tasks by extracting an encoded transformer token for each corresponding input token. In BERT's classic text classification approach, only the sentence starter [CLS] token is used for classification. However, we hypothesize that the discarded tokens can be used as a source of additional information, and should therefore also be used. We also search for inspiration outside of the noisy text analytics domain into the multimodal data domain, with concepts from EmbraceNet (Choi and Lee, 2019), which is proposed to improve robustness in the absence of data. The authors achieve that with an embracement layer that randomly selects important features from each modality vector and meshes them into one robust vector called embraced vector. We propose using the otherwise discarded tokens as a source of additional information and further encode them with a novel attentive embracement and multi-head attention layer. The combination of these layers allows the model to select important features from clean data during training, so that when presented with noisy inputs during testing, it is able to extract a robust latent vector for improved classification performance. We call our proposed model EmbraceBERT (EBERT).
In summary, our contributions are three-fold: • Novel model, called EBERT, for extraction of robust latent representation from noisy text. This model further encodes all transformer encoded tokens, used as a source of additional information, with a combination of a novel attentive embracement and multi-head attention layer. This approach allows for important features to be selected from clean data during training so as to improve performance during testing, when presented with noisy inputs.
• We show improved performance in the ASR noisy text classification task when compared to the baseline model. The dataset used for evaluation is the Chatbot Natural Language Understanding (NLU) Evaluation Corpus for intent classification (clean data) and two variations with ASR error. The model is evaluated on three settings: (1) trained and tested with clean data, (2) trained with clean and tested with noisy data, and (3) trained and tested with noisy data.
• We provide extensive ablation study to show the importance of each module in our method and we also release our PyTorch code.
2 Related Works 2.1 BERT for Text Classification BERT (Devlin et al., 2018), short for Bidirectional Encoder Representations from transformers, is a powerful language representation model that achieves state of the art on various natural language processing tasks, including text classification (Devlin et al., 2018). Fig. 1(a) shows that this model is built on transformers (Vaswani et al., 2017), the first fully attentional model to be used in sequence-modeling tasks, surpassing the existing best models with the introduction of a self-attention mechanism for input representations and positional encoding for sequence ordering. The self-attention mechanism consists of a scaled dot-product attention being used to map a query Q and available information, in the form of key-value pairs K-V , to an output. This mapping is shown in Eq. (1): where d k is the dimension of K. Note that in self-attention, the Q, K, V information is originated from the same input. BERT attains such impressive results with the development of two pre-training tasks: masking and prediction of a small percentage of the input tokens, and prediction of the likelihood that sentence A is followed by sentence B. The combination of these two strategies greatly increases context awareness, enabling it to be a high performing language model. Regarding downstream tasks, BERT can be finetuned by using the encoded transformer tokens (green tokens in Fig. 1(a)), which are extracted for every input token and include a special starter symbol called [CLS] token (Devlin et al., 2018). These encoded tokens contain their distributional contextual representation in the sentence and the choice of which ones are used depend on the task at hand. For the text classification task, the [CLS] token is used as a feature vector representing the entire sentence, which can then be classified with a simple feedforward classifier, as shown in Eq. (2): where p is the probability of class c given the [CLS] token T [CLS] and W is the trainable weight matrix. In this classic approach, the remaining tokens are discarded. Although we concede that the [CLS] token might be a good single representation of the input for clean data, it lacks the same representation efficiency in noisy data. We hypothesize that the remaining tokens can be used to alleviate this issue by serving as a source of additional information to improve the model's robustness and performance.

EmbraceNet for Incomplete Data
EmbraceNet (Choi and Lee, 2019), shown in Fig. 1(b), is a model proposed to improve robustness in classification tasks involving incomplete multimodal data. The authors assume that each modality M n∈{1,2,...,N } is represented with features of different sizes x n , so their first measure is to project those features into vectors of same length d n through a docking layer, sometimes also called adapter or projection layer. Following that layer is an embracement layer, which randomly selects important features d n from each feature vector, and adds them in order to obtain an embraced vector e that can be used for data classification. The authors show good performance using the embraced vector in classification tasks with partially missing data in bi-modal MNIST, sensor, and activity recognition datasets. However, there a few limitations of EmbraceNet that we address in this work. First, since the probability p depends on the number of modalities available, the user must indicate which data are missing and adjust p accordingly. Secondly, the selection probability p is the same for every modality, whereas we believe some might be more important than others. Lastly, the docking layer is redundant for our purpose since the considered tokens have the same size.  Transformer Layers

Proposed Model
In this section, we propose the novel EBERT model for the task of noisy ASR text classification. As the name suggests, our model is based on BERT, and as such, its first layers are transformer layers that encode the input embeddings E into tokens T . Conventionally, when fine-tuning BERT for downstream classification tasks, only the [CLS] starter token is used, with the remaining T i∀i∈{1,...,N } tokens being discarded. Although we concede that the [CLS] token might be a good single representation of the input for clean data, it lacks the same representation efficiency in noisy data.
We tackle this issue by proposing the use of all encoded tokens as a source of additional information to improve the model's robustness and performance. Now, instead of a single vector that can directly be used for classification, we are faced with multiple ones. Our solution, shown in Fig. 2, is to introduce a multi-head attention layer with Q, K, V obtained from T to extract additional important information d, followed by a novel attentive embracement layer to obtain the embraced vector e. Note that we consider each token T , which can be missing or incorrect, as a different modality in order to use the embracement layer. The embraced vector is then given to a projection layer together with the [CLS] token, resulting in a single robust representation vector T C that can be used for classification. The projection layer is as in Eq. (3): where W proj is the trainable projection layer weight matrix. This token can then be classified with a simple feedforward classifier, as shown in Eq. (4): where p is the probability of class c given the final classification token T C and W C is the trainable classification layer weight matrix.

Text Classification
Projection Layer

K Q V
Dimensionality reduction p att Figure 2: Proposed EBERT model: hierarchical structure of transformer layers and multi-head attention followed by a novel attentive embracement layer to obtain the embraced vector e from tokens T . The embraced vector is then given to a projection layer together with the [CLS] token, resulting in a single robust representation vector T C that can be used for classification.
In the conventional embracement layer ( Fig. 3(a)), the embraced vector e = e (1) , e (2) , . . . , e (l) is obtained by first drawing a vector r (i) from a multinomial distribution, as in Eq. (5): where i is the i-th component in vectors e, r, d, and d of same length, and p is the probability of each token being chosen to compose e (i) , as in Eq. (6): with n p n = 1. Now, assume that each vector has length l and the sampling vector r n = r (1) n , r (2) n , . . . , r (l) n . In order to obtain d n , the dot product as calculated as in Eq. (7): Finally, the embraced vector is obtained as follows, in Eq. (8): In the proposed attentive embracement layer ( Fig. 3(b)), the embraced vector e = e (1) , e (2) , . . . , e (l) is obtained in a similar manner, except the probability p of each token being chosen to compose e (i) . In the conventional method, the probability of each token being selected is the same, and by all means random. This approach has two limitations. First, in order to know the probability of each token, the user must indicate which data are missing and adjust p accordingly. Second, not all features in the data have similar importance, and this needs to be addressed for a more robust account of the data. We tackle both issues by introducing an attention layer to obtain the importance of each token T i∈{1,...,N } , compared to the T [CLS] token, to be used in the selection process, as in Eq. (9): where T [CLS] is the query and T i∀i∈{1,...,N } is the context component in this attention module.

Dataset
The dataset used to evaluate the model's performance is the Chatbot NLU Evaluation Corpus for intent classification, introduced by Braun et al. (Braun et al., 2017) to test NLU services. It is a publicly available 2 benchmark and is composed of sentences obtained from a German Telegram chatbot used to answer questions about public transport connections. The dataset has two intents, namely Departure Time and Find Connection with 100 train and 106 test samples 3 . Even though English is the main language of the benchmark, this dataset contains a few German station and street names. The original dataset contains clean data, so in order to include ASR noise, we apply a text-to-speech (TTS) followed by a speech-to-text (STT) module to that data. This process is shown in Fig. 4 Figure 4: Diagram of 2-step process to obtain text with ASR error from clean data. efficient due to TTS and STT modules available being imperfect, resulting in sentences with reasonable levels of ASR noise. The TTS module is called macsay 4 , named after the terminal command say used to convert text to speech in the Mac OS platform. We obtain two datasets with ASR noise by applying two distinct STT modules after the TTS module: witai 5 , freely available and maintained by Wit.ai, and sphinx 6 , open-source Python functionality in the CMU Sphinx speech recognition engine. The mentioned TTS and STT modules are chosen according to code availability and whether it's freely available or has high daily usage limitations. The noise level in the sentences is measured quantitatively with the Word Error Rate (WER) metric, a common metric used to evaluate ASR systems where lower scores mean lower noise levels 7 .

Training Specifications
The proposed model is fine-tuned in an end-to-end manner on the pre-trained weights from BERT BASE : uncased, 12 transformer blocks, hidden size of 768, and 12 self-attention heads. The model is finetuned on a Titan X GPU for 100 epochs with Adam Optimizer, learning rate of 2 * 10 −5 , maximum sequence length of 128, and batch size of 8. The baseline model is BERT BASE , with 109.5M number of parameters, and the proposed model has number of parameters varying from 109.5M to 117.2M (see Table 2 for more detailed information). Each model is run 10 times, with the results being shown as mean and standard deviation.

Results
For the performance evaluation, we calculate the F1-scores for all models fine-tuned on the Chatbot corpus for intent classification for three settings: (1) trained and tested with clean data, (2) trained with clean and tested with noisy data, and (3) trained and tested with noisy data (witai or sphinx STT noise). Results are shown in Table 1 in order of containing low to higher noise. For a fair comparison, we consider two baseline models: BERT using only the [CLS] token for classification and BERT using all tokens, with structures shown in Fig. 5. Results show improved performance with BERT using all tokens in settings containing noisy text, with the biggest improvement in setting 2.
We consider a few combinations of the proposed model, which are grouped into the conventional embracement layer approach or the novel attentive approach, with an expanded analysis in the following section. Results in Table 1 show that our model outperforms the baseline models in all settings. In settings 1 and 2, the best model is the one with conventional embracement and multi-head attention layers, where bigger improvement can be noticed in datasets with higher noise level. From lower to higher noise, clean data shows improvement of 0.19 points, 'witai' with WER of 3.11 shows improvement of 2.83 points and 'sphinx' with WER of 6.58 shows improvement of 3.30 points. In setting 3, the model with attentive embracement and multi-head attention layers achieves the best performance, with 'witai' showing improvement of 1.70 and 'sphinx' 1.42 points. We hypothesize that the conventional embracement is able to obtain better results in setting 2 because it retains more general characteristics from data 4 https://ss64.com/osx/say.html 5 https://wit.ai 6 https://cmusphinx.github.io/wiki/ 7 More details on the WER metric and examples from the dataset with ASR noise can be found in the Appendix. by giving equal weights to all tokens, regardless of its importance. Whereas in the attentive embracement approach, if a token deemed important is absent during testing, this might hinder performance. On the other hand, attentive embracement shows the strength of token differentiation in setting 3 and its ability to extract a robust representation of noisy inputs for improved text classification. . We also group the models into conventional and attentive embracement for more clarity.

Ablation Study
In this section, we provide an extensive ablation study on the models that use all tokens for text classification. Fig. 5 shows the different architectures using BERT in order for it to use all tokens in the classification task. The simplest architecture, Fig. 5(a), consists of adding a dimensionality reduction layer, either attention or projection, to obtain a single token for classification. However, the limitation of that approach is that it considers the [CLS] token as having the same importance as the remaining tokens. Since this token is used by itself in the conventional BERT classification task, it is fair to assume that the starter token holds very meaningful features and should be considered in higher regard. Thus, the second approach in Fig. 5 Table 2 show that the approach in Fig. 5(c) outperforms other combinations in all settings. We hypothesize that by extracting important features from the remaining tokens, when compared to the starter token, the model is able to retain meaningful information that would have been otherwise discarded. Fig. 6 shows the different architectures used in the ablation study of EBERT. Here, we investigate the importance of the conventional and attentive embracement layers, multi-head attention layer, and different types of dimensionality reduction methods. We also evaluate the architecture that concatenates the intermediate feature vector obtained the from attention layer and the one obtained from the embracement layer. The architecture in Fig. 6(a) extracts intermediate information from a multi-head attention layer and a conventional embracement layer, followed by a dimensionality reduction layer which can either be projection or attention. In Fig. 6(b), we concatenate the intermediate feature vector obtained with the best performing architecture in Fig. 5(c) and the intermediate feature vector obtained from a multi-head attention followed by a conventional embracement layer. Figs. 6(c) and (d) are similar to (a) and (b), respectively, except that the embracement layer follows an attentive approach.

Attention or Projection Layer
BERT+att BERT+proj  Figure 6: Different architectures used in the ablation study of EBERT: (a) multi-head attention and conventional embracement layer followed by an attention (EBERT+att) or projection (EBERT+proj) layer for dimensionality reduction, (b) concatenation of (a) and the best performing architecture in Fig. 5(c) (EBERTconcatt+att or EBERTconcatt+proj), (c) and (d) are the same as (a) and (d), respectively, but with an attentive embracement layer. Note that the models used in the ablation study for EBERT without the multi-head attention layer (EBERT(no QKV)) are essentially the same except for the absence of the mentioned layer. Table 2 show that EBERT, using architecture in Fig. 6(d) with an attentive embracement layer and projection layer for dimensionality reduction, outperforms all of the models in the third setting. In setting 1, the best performance can be seen from the same model with conventional embracement layer. Lastly, in setting 2, EBERT with conventional embracement and projection layer, Fig. 6(a), achieves the best scores. This study shows the importance of the embracement and multi-head attention layer combination to achieve high performance in settings with clean and noisy data. This is especially true in settings with noisy data in both training and testing. Furthermore, Table 2 also shows that this increase in performance is achieved with just a slight increase in computational complexity, measured in the number of parameters.  (1) trained and tested with clean data, (2) trained with clean and tested with noisy data, and (3) trained and tested with noisy data (witai or sphinx STT noise). Results are shown in terms of mean and standard deviation between 10 runs. Note that the BERT baseline model only uses the [CLS] token for text classification, whereas the remaining models use all tokens including [CLS]. We also group the models into conventional and attentive embracement for more clarity.

Study on a larger dataset
As part of our ablation study, we also provide detailed evaluation results on the Snips NLU Corpus (Coucke et al., 2018) 8 , a large dataset containing 15K crowd-sourced queries with a total of 7 intents, namely Add To Playlist, Book Restaurant, Get Weather, Play Music, Rate Book, Search Creative Work, and Search Screening Event, totalling in 13,784 train and 700 test samples 9 . The noisy version of this dataset includes ASR error added to the data in the two-step process previously shown in Fig. 4. We train the same models listed in Table 2 on the Snips dataset and with training specifications discussed in Section 5.1. The only modification being that we use a larger batch size of 32 for faster training. The ablation study in Table 3 shows that our proposed approach, of using all tokens for text classification, results in better performance over all three settings when trained on the larger dataset. In setting 1, it can be seen that all but one EBERT variant outperforms the baseline model. We also evaluate the models' performance on noisy data, with the results showing improved performance of +0.47 for 'witai' and +2.13 for 'sphinx' on unseen noisy data (setting 2), and improved classification result when trained with noisy data (setting 3) albeit with a smaller improvement margin.

Conclusion
We proposed a novel BERT-based model, called EBERT, that improves robustness in the task of noisy ASR text classification. Conventionally, when fine-tuning BERT for downstream classification tasks, only the [CLS] starter token is used, with the remaining tokens being discarded. Our model used those otherwise discarded tokens as a source of additional information, together with a multi-head attention and attentive embracement layer, to more efficiently extract robust latent representations in noisy data.  Table 2 but on a larger dataset, so for more details on this study, please check the caption in Table 2. We evaluated our model in three settings: trained and tested with clean data, trained with clean and tested with noisy data, and trained and tested with noisy data. Results on the Chatbot corpus for intent classification, in terms of F1-score and mean between 10 runs, showed that our model outperforms the baseline models in all settings. In settings 1 and 2, the best model was the one with conventional embracement and multi-head attention layers, with bigger improvement in datasets with higher noise level. In setting 3, the model with attentive embracement and multi-head attention layers achieved the best performance. We hypothesize that conventional embracement was able to obtain better results in setting 2 because it retains more general characteristics from data by giving equal weights to all tokens, regardless of its importance. On the other hand, attentive embracement showed the importance of token differentiation in setting 3 and its ability to extract a robust representation of noisy inputs. This increase in performance is achieved with just a slight increase in computational complexity, measured in the number of parameters. We additionally provided an extensive ablation study showing the importance of the embracement and multi-head attention layer combination to achieve high performance in settings with both clean and noisy data, and we also reinforce the importance of our approach in a larger dataset. In the future, we plan on evaluating our approach with other BERT variations and we plan on considering the effect of clean and noisy data together in the training stage.

A Datasets
A.1 Chatbot NLU Evaluation Corpus Table 4 shows the dataset distribution between classes and train/test on the Chatbot NLU Evaluation Corpus (Braun et al., 2017) and Table 5 shows some examples of clean and their respective noisy sentences with different TTS-STT combinations, therefore varying rates of noise.  "how can i get from quiddestraße." sphinx 6.58 "can i get from los at lives three." witai 3.11 "how to get from alte heide to marienplatz" "how to get from altona to maryland." sphinx 6.58 "call now from outside to memory and flaps." witai 3.11 "next bus from central station" "next bus from central station." sphinx 6.58 "there are strong central station." Table 5: Examples of sentence from Chatbot NLU Corpus with different TTS(macsay)-STT(witai, sphinx) combinations and their respective WER score, which denotes the level of ASR noise in the text. The datasets are shown in order of lower to higher noise. Table 6 shows the dataset distribution between classes and train/test on the Snips NLU Corpus (Coucke et al., 2018), and Table 7 shows some examples of clean and their respective noisy sentences with different TTS-STT combinations.  "what are the movie times." sphinx 7.24 "we're tarzan movie times." witai 2.66 "will Custer National Forest be chillier at seven Pm"

A.2 Snips NLU Corpus
"will there national forest be chillier at seven pm." sphinx 7.24 "will cause a national far is still you're upset him." witai 2.66 "Book a reservation for an oyster bar" "reservation for an oyster bar." sphinx 7.24 "for reservations for an oyster bar." Table 7: Examples of sentence from Snips NLU Corpus with different TTS(macsay)-STT(witai, sphinx) combinations and their respective WER score, which denotes the level of ASR noise in the text. The datasets are shown in order of lower to higher noise.

B WER metric
The noise level in the sentences is measured quantitatively with the WER metric, a common metric used to evaluate ASR systems where lower scores mean lower noise levels. This metric calculates the minimum number of edits required to change a candidate sentence into a reference sentence. Mathematically speaking, WER considers words substitution S, deletion D and insertion I compared to the total number of words N in the reference, as shown in Eq. (10): where C the number of the correct words. The lower WER, the better, since it means less editions are needed for the sentence to be converted to the gold standard. The lowest possible value in this metric is 0, with no ceiling for its maximum value.