Improving Natural Language Understanding by Reverse Mapping Bytepair Encoding

We propose a method called reverse mapping bytepair encoding, which maps named-entity information and other word-level linguistic features back to subwords during the encoding procedure of bytepair encoding (BPE). We employ this method to the Generative Pre-trained Transformer (OpenAI GPT) by adding a weighted linear layer after the embedding layer. We also propose a new model architecture named as the multi-channel separate transformer to employ a training process without parameter-sharing. Evaluation on Stories Cloze, RTE, SciTail and SST-2 datasets demonstrates the effectiveness of our approach.


Introduction
Recently, language models are widely used as the feature extractor for many NLU tasks. Statistical language models learn the joint probability function of sequences of words in a language (Bengio et al., 2003). Trained on large corpora and different domains give LMs generalization abilities, and enable them to capture latent or deep senmantics. Normally, statistical language models take the fol-lowing objective to maximize: where k is the size of the context window, and the conditional probability P is modeled using a neural network with parameters Θ.
However, this formal simplicity determines that they can not deal well with the ambiguity of words nor low-frequency words. In fact, low-frequency words may appear once or little times in the whole corpus, thus the embeddings of them may overfit or underfit the corpus. In practice, we usually truncate them from the vocabulary and replace them with an "UNK" label. In this situation, we partially lose their meanings. For example, in "15 million tonnes of rubbish are produced daily in Cairo.", the number "15" can hardly be trained to gain its proper representation even if it is replaced by a specific label representing numbers. Besides, out of vocabulary (OOV) is a big problem while predicting. Some kinds of word segmentation technologies have been proposed ray Su and yi Lee, 2017) to solve these problems.
BPE  has been proposed to handle these problems as well. It shows its powerful effectiveness in many works Radford et al., 2018;Devlin et al., 2018). However, it is originally designed to handle open-vocabulary problem in machine translation, basing on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). Thus compared with semantic characteristics, morphological and compounding characteristics are more considered. The unique meanings of some proper nouns are missing while simply applying it to some NLU tasks. Consider the following textual entailment example: Example 1. In this example, proper nouns play an important role, but they are divided into wordpieces shared with other words. Especially they have common pieces such as "i" and "o". Therefore, inferring from the encoded sequence can be quite difficult.
Apart from this, let us motivate the lack of concern on some key words or phrases of such method. Here is another example: Example 2. t. Cairo is now home to some 15 million people -a burgeoning population that produces approximately 10,000 tonnes of rubbish per day, putting an enormous strain on public services... h. 15 million tonnes of rubbish are produced daily in Cairo.
t→h: not entailment In this case, "15 million tonnes" is the key phrase while "15", "million" and "tonnes" also appear in the context. Word-level or subword-level information is obviously not enough.
To alleviate these issues, we propose a reverse mapping bytepair encoding method to integrate prior knowledge into subwords. The prior knowledge mentioned here includes named-entity information, part-of-speech (POS) tags and dependency parsing labels. Our method has two forms and both of them modify the encoding procedure of BPE. In the conventional form, firstly we tag and parse the target sentence which needs to be tagged with NER, POS taggers as well as a dependency parser. After that, we handle the target sentence with the original BPE algorithm. Note that the taggers and the parser work on the word-level while BPE works on the wordpiece-level. Finally, we encode every wordpiece as a combination of linguistic features of its parent word and itself. To avoid over-reliance on the performance of external tools and error propagation, we modify the former as named-entity phrase based reverse mapping, which adds named-entity phrases to the vocabulary during the scanning process of the whole corpus and encodes a wordpiece as the combination of the named-entity phrase where the wordpiece comes from and itself. We evaluate our method on a set of NLU tasks by applying it to GPT. More specifically, we add a weighted layer after the embedding layer to get different weighted combinations of the inputs. We also propose a new model architecture named as the multi-channel separate transformer to employ a training process without parameter-sharing for wordpieces and additional features. We evaluate our approach on two natural language inference tasks (RTE, SciTail), a question answering task (Story Cloze Test) and a classification task (SST-2), showing the benifits of our approach.

Related Work
Improving natural language understanding requires better techniques for modeling natural language. There have been many researchers working on better capturing semantic and morphological information of word vectors (Mikolov et al., 2013a,b;Huang et al., 2012;Levy and Goldberg, 2014b;Pennington et al., 2014). Utilizing internal information has been widely studied, and most of these works employed structural information between words and smaller units (Chen et al., 2015;Iacobacci et al., 2015;Yu et al., 2017;Xu et al., 2018). Another relative research direction is to use external knowledge. The researchers in Microsoft (Song et al., 2011) employed a big and rich probabilistic knowledgebase to machine learning algorithms, and got significant improvement in terms of tweets clustering accuracy. However, such method needs huge human and material resources to build up a highquality and extremely wide-coverage knowledge base. Recently, a novel language representation model called ERNIE (Zhang et al., 2019) has been proposed. ERNIE introduces knowledge-related tasks in the pre-training process. Besides, it utilizes graph embedding methods to get the embedded representation of entities. Compared with it, our method does not rely upon external knowledge bases and it can be a double-edged sword. In addition, our method provides syntactic information integration and allows principled integration of named-entity information in an easier way.
There is also a class of method, instead of relying a lot on external knowledge, it takes advantage of the linguistic features that exist in natural language. Levy (Levy and Goldberg, 2014a) generalized the skip-gram model to include arbitrary contexts by dependency parsing. The dependencybased embeddings are less topical and exhibit more functional similarity than the original skipgram embeddings. Multimodal representations of chinese characters (ray Su and yi Lee, 2017) have also been studied, and the research showed the effectiveness of glyph features in some cases. Sennrich  proposed an approach to employ linguistic features for neural machine translation (NMT). These features include lemmas, subword tags, morphological features, POS tags and dependency labels. Different from us, they paid more attention on how to improve NMT from the morphological level. Besides, they did not consider named entities. Another work relevant to us is (Nallapati et al., 2016). They proposed a feature-rich encoder to capture keywords in summarization. Their model employs a word-level vocabulary, and the words are embedded by concatenating all kinds of features including POS, NER tags and discretized TF and IDF values. Different from them, our approach works on the subword level and leverages linguistic features at the semantic level.
There are many kinds of neural networks that can deal with NLU tasks. Recently, pre-trained language models or language representation models have been widely used and these works got significant improvements (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2018). Trained on large corpora will inevitably face the bigvocabulary problem and the OOV problem. Thus researchers have proposed several methods to handle them. FastText  employed n-grams thus it could predict the zero-shot word embeddings. However, n-gram is an arbitrary method of word segmentation. Sennrich  adapted the original byte pair encoding (BPE) compression algorithm to NMT and got significant success. This method iteratively merges most frequent adjacent characters or character sequences in a word until it can not be done, based on the well learned token rank. Therefore it can encode any word with a pre-learned token vocalulary. We give an illustration as Figure 1, showing the ranking process of BPE. GPT (Rad-ford et al., 2018) applied BPE to its training process and got impressive achievements in a series of NLU tasks. Due to the greedy nature of BPE algorithm, the results produced by BPE are deterministic, resulting in insufficient robustness in machine translation. Kudo (Kudo, 2018) located this problem and proposed a subword regularization method to handle it. He trained the NMT model with multiple subword segmentations generated in a probability manner and got improvements especially on low resource and out-of-domain settings. This study provides an interesting insight, but we argue that it may not bring significant improvement in NLU tasks. In fact, BPE is proposed to model open-vocabulary translation in NMT, which is inconsistent with the goal of some natural language understanding tasks. That motivates us to start our work. The byte pair "aa" occurs most often, so it can be replaced by "Z", which doesn't appear in the corpus.

XdXac X=ZY, Y=ab, Z=aa
Finally, we got the sequence that can not be compressed further because there are no pairs of bytes that occur more than once.

Methods
In this section, we introduce our reverse mapping bytepair encoding method and the procedure of applying it to GPT. The reverse mapping bytepair encoding has two forms: label based (LB-BPE) and named-entity phrase based (NPB-BPE). Figure 2 provides a visual illustration. We employ them to the fine-tuning process of GPT by adding a weighted layer after the embedding layer. Besides, we propose a multi-channel separate transformer (MCST) to evaluate the utility of introduced features and give some insights into the semantic capture capabilities of the transformer. The model architecture is shown in Figure 3.

Label Based Reverse Mapping Bytepair Encoding
We extend a token t generated by BPE to four parts {t, t pos , t ner , t dep }, where t is the same as the original encoded token in BPE, t pos is the POS tag of the word which generates this token, sim-ilar with t ner and t dep . Not all words are tagged with named-entity labels, therefore we tag these tokens without NER labels as "NaN". We regard t pos , t ner , t dep as additional features that could not be captured fully by language models. Schemes come from Spacy Annotation Specifications 1 . We use spaCy 2 to tokenize, tag and parse the datasets during our experiments. We simply use the dependency labels instead of the whole dependency parsing tree because of two reasons: one is it brings complex changes to the input architecture, the other is that we assume the transformer architecture has the ability to establish some kinds of patterns to capure the dependencies between tokens generated by BPE. As we mentioned in Introduction 1, some key words are low-frequency. Such as a person name "Kevin Federline", it can be encoded as "kevin /w ", "feder" and "line /w ", and the meaning will be changed. Passing the keyword information to tokens is a simple and feasible approach to solve this problem. There are many ways of defining which words are key words and thus the problem can be defined as following:

Named-entity Phrase Based Bytepair Encoding
Definition 1. Find an algorithm A, ∀k∈S, A(k) = 1 if k is a key word, else A(k) = 0.
where S represents a sentence. This is a two-category classification problem. In fact, these key words usually perform as people's names, organizations or other types of named entities. Therefore, we choose a NER algorithm 3 as algorithm A and we modify the encoding process of BPE algorithm as algorithm 1 to pass the entity attribute to tokens. By reverse mapping named-entity words or phrases appeared in the corpus back to BPE tokens, key information has a way to be preserved. Some kinds of named entities (ep. DATE, TIME, PERCENT, MONEY, QUAN-TITY, ORDINAL, etc) contain numbers and they usually express general attributes, such as "22year-old". We add a switch in practice to control whether processing these words with NPB-BPE or not.
The main difference between LB-BPE and NPB-BPE is that LB-BPE makes ordinary use of additional semantic information that can not be easily captured by statistical language models in a specific language, while NPB-BPE provides a way to share important concepts within a corpus.

Training Process
GPT is followed by many studies for its excellent generalization performance and we will give a brief introduction to it in the first subsubsection. Due to the expensive cost of pre-training, we reuse the OpenAI pre-trained language model 4 parameters, and we train new embeddings and finetune all parameters during the fine-tuning process. Figure 4 shows the modified preprocess procedure. Considering the input is enhanced by different kinds of new features, we propose two different processes for training.

Generative Pre-trained Transformer
The Generative Pre-trained Transformer (Radford et al., 2018), also known as OpenAI GPT or called as GPT, is a language representation model which  is pre-trained on large corpora and fine-tunes all pre-trained parameters on the downsteam tasks. Its main architechture was originally described in (Vaswani et al., 2017). Unless otherwise stated, the GPT mentioned in this paper refers to a 12layer decoder-only transformer with masked selfattention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we use 3072 dimensional inner states. Other hyperparameters are set as well as the GPT paper (Radford et al., 2018) for comparison.

Sharing Parameters
With this method, we treat additional features like positional encoding. We add these tags to the vocabulary and random initialize their embeddings in the embedding layer, thus the input can be described as following: where E t , E p , E li represent the token embedding matrix, position embedding matrix and linguistic feature embedding matrix, C l is the collection of all types of linguistic features and w wt , w wp , w wli are the corresponding weight scalars. The following data flow is the same as GPT (Liu et al., 2018). We use the following objective to minimize: where C represents the labeled dataset, L 2 (C) is the cross entropy loss of classification got on C, L 1 (C) is the cross entropy loss of the language model got on C and λ is a weight parameter between 0 and 1.

Multi-channel Separate Transformer
GPT uses a multi-layer transformer decoder as the language model, and it learns patterns in the language by simply observing the token-level sequences. In this part, we use LB-BPE to encode sequences and with this method we have multiple input channels. We employ two stand-alone multi-layer transformer decoder to separate the parameters learned on different perspectives. One is the 12-layer pre-trained transformer released by OpenAI GPT, and the other is an entirely new one which has different amount of layers with the former and takes responsibility for linguistic features.

Setup
We follow the model hyperparameters mentioned in the GPT (Liu et al., 2018) paper. We use learned position embeddings with variable length depending on downstream tasks. We perform a pilot experiment and the result shows that it is almost the best to set the weighted layer as 1.0 for each channel. We use a 16G ASPEED Graphics Family (rev 30) card for training.

Datasets Story
Cloze Test Story Cloze Test (Mostafazadeh et al., 2017) is a new commonsense reasoning framework for evaluating story understanding, story generation, and script learning. This test requires a system to choose the correct ending to a four-sentence story. We choose it as a part of our evaluation corpora because it requires the model to capture rich linguistic phenomena.

Model
Story Cloze(Acc%) RTE(Acc%) SciTail(Acc%) SST-2(Acc%) jose.fonollosa's model 87. The details of the models for comparison in this table can be found in the leaderboards of the corresponding dataset official websites. Each row in the second block represents a LB-BPE result with different combination of features. NER represents using named-entity labels, POS represents part of speech tags and DEP represents synactic dependency parsing tags. NPB-BPE * represents filtering out named entities within these types (DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL). ference also called textual entailment tasks which given a text t and a hypothesis h, t entails h, if, typically, a human reading t would infer that h is most likely true. The relation is directional. These two datasets differ from each other in size and domain.
SST-2 SST-2 (Socher et al., 2013) is related to sentiment analysis, containing 56.4k movie reviews and each of them has a binary label.

Evaluation of Label Based Reverse Mapping Bytepair Encoding
We evaluate LB-BPE on these datatsets and conduct multiple controlled trials based on different combinations of linguistic features by employing the parameter-sharing training procedure. Details is described in Table 2. The result shows that incorporating named-entity features usually works well, while utilizing POS tags sometimes can bring significant performance improvement. Besides, it is not a good way to use external features as many as possible, most likely because of the error propagation. However, combinations of named-entity information and others may bring a small boost. Compared with GPT, we achieved an absolute increase of 1.58% on Story Cloze Test, 6.4% on RTE, 0.69% on SciTail and 0.8% on SST-2 at our best.

Evaluation of Named-entity Phrase Based Bytepair Encoding
We evaluate NPB-BPE with the parameter-sharing training procedure. Details are shown in Table 2.
Overall, NPB-BPE improves the performances for all datasets while not overly relying on external features. On Story Cloze Test, both kinds of NPB-BPE are not worse than LB-BPE (NER). On RTE, NPB-BPE * nearly achieves the best performance of LB-BPE. Little performance boost is shown on SciTail, probably because sentences in SciTail are more focused on the expressions of concepts and knowledge, and thus key words are more about verbs and nouns rather than named entities. However, as we unexpected, applying NPB-BPE * to SST-2 gains a 0.6% absolute increase. We also collect the statistics of named entities on these datasets. Details is shown in Table 3. We observe that there is a positive correlation between named entities percentage and performance improvement.

Case Study
We compare the attention weights of GPT and our LB-BPE (NER+POS) approach in a textual intailment example as shown in Figure 6. GPT labels it as not entailment while our approach labels it as entailment which is the right answer. The attention weights come from the same attention head in the last transformer block. Changes in some parts lead to the correct answer.

Conclusion and Future Work
In this paper, we introduce a simple approach called reverse mapping bytepair encoding to improve natural language understanding based on pre-trained language models. The reverse mapping bytepair encoding has two forms: One is label based and the other is named-entity phrase based. Both forms introduce extra information to the tokens generated by BPE while the former ordinarily employs linguistic features, and the latter provides a way to share concepts through named-entity phrases. In addition, in the second form, we summarize the problem we are trying to solve into a two-category problem that judges whether a word is a key word. By applying them to the fine-tuning process of GPT, we gain about 1-6% improvement on downsteam NLU tasks. We also propose a new model architecture named as MCST and experiments based on it shows its effectiveness in some cases. Besides, the experimental results show that linguistic features usually perform at a higher level compared with words and wordpieces. It is quite easy to apply our method to the existing language models. There could be several directions to be explored for future works. Language models have many forms, we only test our approach on GPT, a follow up direction is finding if it is generic enough. In this paper, we don't employ pre-training for linguistic features, it might be better by doing this. There are several researches focusing on incorporating knowledge into systems to improve their performances, thus we are looking forward to finding a smooth way to utilize named entities with prior knowledge or knowledge graphs.