UER: An Open-Source Toolkit for Pre-training Models

Existing works, including ELMO and BERT, have revealed the importance of pre-training for NLP tasks. While there does not exist a single pre-training model that works best in all cases, it is of necessity to develop a framework that is able to deploy various pre-training models efficiently. For this purpose, we propose an assemble-on-demand pre-training toolkit, namely Universal Encoder Representations (UER). UER is loosely coupled, and encapsulated with rich modules. By assembling modules on demand, users can either reproduce a state-of-the-art pre-training model or develop a pre-training model that remains unexplored. With UER, we have built a model zoo, which contains pre-trained models based on different corpora, encoders, and targets (objectives). With proper pre-trained models, we could achieve new state-of-the-art results on a range of downstream datasets.


Introduction
Pre-training has been well recognized as an essential step for NLP tasks since it results in remarkable improvements on a range of downstream datasets (Devlin et al., 2018).Instead of training models on a specific task from scratch, pretraining models are firstly trained on generaldomain corpora, then followed by fine-tuning on downstream tasks.Thus far, a large number of works have been proposed for finding better pretraining models.Existing pre-training models mainly differ in the following three aspects: 1) Model encoder.
Using proper target is one of the keys to the success of pre-training.While the language model is most commonly used (Radford et al., 2018), many works focus on seeking better targets such as masked language model (cloze test) (Devlin et al., 2018) and machine translation (McCann et al., 2017).
Using a proper fine-tuning strategy is also important to the performance of pre-training models on downstream tasks.A commonly-used strategy is to regard pre-trained models as feature extractors (Kiros et al., 2015).
There are many open-source implementations of pre-training models, such as Google BERT1 , ELMO from AllenAI2 , GPT and BERT from Hug-gingFace3 .However, these works usually focus on the designs of either one or a few pre-training models.Due to the diversity of the downstream tasks and the computational resources constraint, there does not exist a single pre-training model that works best in all cases.BERT is one of the most widely used pre-training models.It exploits two unsupervised targets for pre-training.But in some scenarios, supervised information is critical to the performance of downstream tasks (Conneau et al., 2017;McCann et al., 2017).Besides, in many cases, BERT is excluded due to its efficiency issue.Based on above reasons, it is often the case that one should adopt different pre-training models in different application scenarios.
In this work, we introduce UER, a general framework that is able to facilitate the developments of various pre-training models.UER maintains model modularity and supports research extensibility.It consists of 4 components: subencoder, encoder, target, and downstream task finetuning.The architecture of UER (pre-training part) is shown in Figure 1.Ample modules are implemented in each component.Users could assem-ble different modules to implement existing models such as BERT (right part in Figure 1), or develop a new pre-training model by implementing customized modules.Clear and robust interfaces allow users to assemble (or add) modules with as few restrictions as possible.
With the help of UER, we build a Chinese pretrained model zoo based on different corpora, encoders, and targets.Different datasets have their own characteristics.Selecting proper models from the model zoo can largely boost the performance of downstream datasets.In this work, we use Google BERT as baseline model.We provide some use cases that are based on UER, and the results show that our models can either achieve new state-of-the-art performance, or achieve competitive results with an efficient running speed.
UER is built on PyTorch and supports distributed training mode.Clear instructions and documentations are provided to help users read and use UER codes.The UER toolkit and the model zoo are publicly available at https:// github.com/dbiir/UER-py.

Pre-training for deep neural networks
Using word embedding to initialize neural network's first layer is one of the most commonly used strategies for NLP tasks (Mikolov et al., 2013;Kim, 2014).Inspired by the success of word embedding, some recent works try to initialize entire networks (not just first layer) with pre-trained parameters (Howard and Ruder, 2018;Radford et al., 2018).They train a deep neural network upon large corpus, and fine-tune the pretrained model on specific downstream tasks.One of the most influential works among them is BERT (Devlin et al., 2018).BERT extracts text features with 12/24 Transformer layers, and exploits masked language model task and sentence prediction task as training targets (objectives).The drawback of BERT is that it requires expensive computational resources.Thankfully, Google makes its pre-trained models publicly available.So we can directly fine-tune on Google's models to achieve competitive results on many NLP tasks.

NLP toolkits
Many NLP models have tens of hyper-parameters and various tricks, and some of which exert large impacts on final performance.Sometimes it is unlikely to report all details and their effects in research paper.This may lead to a huge gap between research papers and code implementations.To solve the above problem, some works are proposed to implement a class of models in a framework.This type of work includes OpenNMT (Klein et al., 2017), fairseq (Ott et al., 2019) for neural machine translation; glyph (Zhang and Le-Cun, 2017) for classification; NCRF++ (Yang and Zhang, 2018) for sequence labeling; Hyperwords (Levy et al., 2015), ngram2vec (Zhao et al., 2017) for word embedding, to name a few.
Recently, we witness many influential pretraining works such as GPT, ULMFiT, and BERT.We think it could be useful to develop a framework to facilitate reproducing and refining those models.UER provides the flexibility of building pre-training models of different properties.

3
In this section, we firstly introduce the core components in UER and the modules that we have implemented in each component.Figure 1 illustrates UER's framework and detailed modules (pre-training part).Modularity design of UER largely facilitates the use of pre-training models.At the end of this section, we will give some case studies to illustrate how to use UER effectively.

Subencoder
This layer learns word vectors from subword features.For English, we use character as subword features.For Chinese4 , we use radical and pinyin as subword features.As a result, the model can be aware of internal structures of words.Subword information has been explored in many NLP tasks such as text classification (Zhang and Le-Cun, 2017) and word embedding (Joulin et al., 2016).In the pre-training literature, ELMO exploits subencoder layer.In UER, we implement RNN and CNN as subencoders, and use mean pooling or max pooling upon hidden states to obtain fixed-length word vectors.

Encoder
This layer learns features from word vectors.UER implements a series of basic encoders, including LSTM, GRU, CNN, GatedCNN, and Atten-tionNN.Users can use these basic encoders directly, or use their combinations.The output of an encoder can be fed into another encoder, forming networks of arbitrary layers.UER provides ample examples of combining basic encoders (e.g.CNN + LSTM).Users can also build their custom combinations with basic encoders in UER.
Currently, Transformer (a structure based on multi-headed self-attention) becomes a popular text feature extractor and is proven to be effective for many NLP tasks.We implement Transformer module and integrate it into UER.With Transformer module, we can implement models such as GPT and BERT easily.

Target (objective)
Using suitable target is the key to the success of pre-training.Many papers in this field propose their targets and show their advantages over other ones.UER consists of a range of targets.Users can choose one of them, or use multiple targets and give them different weights.In this section we introduce targets implemented in UER.
• Language model (LM).Language model is one of the most commonly used targets.It trains model to make it useful to predict current word given previous words.
• Masked LM (MLM, also known as cloze test).The model is trained to be useful to predict masked word given surrounding words.MLM utilizes both left and right contexts to predict words.LM only considers the left context.
• Autoencoder (AE).The model is trained to be useful to reconstruct input sequence as close as possible.
The above targets are related with word prediction.We call them word-level targets.Some works show that introducing sentence-level task into targets can benefit pre-training models (Logeswaran and Lee, 2018;Devlin et al., 2018).
• Next sentence prediction (NSP).The model is trained to predict if the two sentences are continuous.Sentence prediction target is much more efficient than word-level targets.It doesn't involve sequentially decoding of words and softmax layer over entire vocabulary.
Above targets are unsupervised tasks (also known as self-supervised tasks).However, supervised tasks can provide additional knowledge that raw corpus can not provide.
• Neural machine translation (NMT).CoVe (McCann et al., 2017) proposes to use NMT to pre-train model.The implementation of NMT target is similar with autoencoder.Both of them involve encoding source sentences and sequentially decoding words of target sentences.
Most pre-training models use above targets individually.It is worth trying to use multiple targets at the same time.Some targets are complementary to each other, e.g.word-level target and sentencelevel target (Devlin et al., 2018), unsupervised target and supervised target.In experiments section, we demonstrate that proper selection of target is important.UER provides the flexibility to users in trying different targets and their combinations.

Fine-tuning
UER exploits similar fine-tuning strategy with ULMFiT, GPT, and BERT.Models on downstream tasks share structures and parameters with pre-training models except that they have different target layers.The entire models are fine-tuned on downstream tasks.This strategy performs robustly in practice.We also find that feature extractor strategy produces inferior results on models such as GPT and BERT.
Most pre-training works involve 2 stages, pretraining and fine-tuning.But UER supports 3 stages: 1) pre-training on general-domain corpus; 2) pre-training on downstream dataset; 3) finetuning on downstream dataset.Stage 2 enables models to get familiar with the distributions of downstream datasets (Howard and Ruder, 2018;Radford et al., 2018).It is also called semisupervised fine-tuning strategy in the work of Dai and Le (2015) since stage 2 is unsupervised and stage 3 is supervised.

Case Studies
In this section, we show how UER facilitates the use of pre-training models.First of all, we demonstrate that UER can build most pre-training models easily.As shown in the following code, only a few lines are required to construct models with the interfaces in UER.In practice, users can assemble different subencoder, encoder, and target modules without any code work.Users can specify modules through options -subencoder, -encoder, and -target.More details are available in quickstart and instructions of UER's github project.UER provides ample modules.Users can try different module combinations according to their downstream datasets.Besides trying modules implemented by UER, users can also develop their customized modules and integrate them into UER seamlessly.

Experiments
To evaluate the performance of UER, experiments are conducted on a range of datasets, each of which falls into one of four categories: sentence classification, sentence pair classification, sequence labeling, and document-based QA.BERT-base uncased English model and BERTbase Chinese model are used as baseline models.
In section 4.1, UER is tested on several evaluation benchmarks to demonstrate that it can produce models as intended.In section 4.2, we ap- ply pre-trained models in our model zoo to different downstream datasets.Significant improvements are witnessed when proper encoders and targets are selected.For space constraint, we put some contents in UER's github project, including dataset and corpus details, system speed, and part of qualitative/quantitative evaluation results.

Reproducibility
This section uses English/Chinese benchmarks to test BERT implementation of UER.For English, we use sentence and sentence pair classification datasets in GLUE benchmark (dev set) (Wang et al., 2019).For Chinese, we use five datasets of different sentiment analysis, sequence labeling, question pair matching, natural language inference, and document-based QA (provided by ERNIE5 ).

Influence of targets and encoders
In this section, we give some examples of selecting pre-trained models given downstream datasets.Three Chinese sentiment analysis datasets are used for evaluation.They are Douban book review, Online shopping review, and Chnsenticorp datasets.
First of all, we use UER to pre-train on largescale Amazon review corpus with different targets.The parameters are initialized by BERT-base Chinese model.The target of original BERT consists of MLM and NSP.However, NSP is not suitable for sentence-level reviews (we have to split reviews into multiple parts).Therefore we remove NSP target.In addition, Amazon reviews are at-tached with users' ratings.To this end, we can exploit CLS target for pre-training (similar with InferSent).We fine-tune these pre-trained models (with different targets) on downstream datasets.The results are shown in BERT requires heavy computational resources.To achieve better efficiency, we use UER to substitute 12-layers Transformer encoder with a 2-layers LSTM encoder (embedding size and hidden size are 512 and 1024).We still use the above sentiment analysis datasets for evaluation.The model is firstly trained on mixed large corpus with LM target, and then trained on large-scale Amazon review corpus with LM and CLS targets.For space constraint, this section only uses sentiment analysis datasets as examples to analyze the influence of different targets and encoders.More tasks and pre-trained models are discussed in UER's github project.

Conclusion
This paper describes UER, an open-source toolkit for pre-training on general-domain corpora and fine-tuning on downstream tasks.We demonstrate that UER can largely facilitate implementations of different pre-training models.With the help of UER, we pre-train models based on different corpora, encoders, targets and make these models publicly available.By using proper pre-trained models, we can achieve significant improvements over BERT, or achieve competitive results with an efficient training speed.

1#
I m p l e m e n t a t i o n o f BERT . 2 embedding = B e r t E m b e d d i n g ( a r g s , v o c a b s i z e ) 3 e n c o d e r = B e r t E n c o d e r ( a r g s ) 4 t a r g e t = B e r t T a r g e t ( a r g s , v o c a b s i z e ) 5 6 # I m p l e m e n t a t i o n o f GPT .7 embedding = B e r t E m b e d d i n g ( a r g s , v o c a b s i z e ) 8 e n c o d e r = G p t E n c o d e r ( a r g s ) 9 t a r g e t = LmTarget ( a r g s , v o c a b s i z e ) 10 11 # I m p l e m e n t a t i o n o f Quick−t h o u g h t s .
12 embedding = Embedding ( a r g s , v o c a b s i z e ) 13 e n c o d e r = G r u E n c o d e r ( a r g s ) 14 t a r g e t = N s p T a r g e t ( a r g s , None ) 15 16 # I m p l e m e n t a t i o n o f I n f e r S e n t . 17 embedding = Embedding ( a r g s , v o c a b s i z e ) 18 e n c o d e r = L s t m E n c o d e r ( a r g s ) 19 t a r g e t = C l s T a r g e t ( a r g s , None )

Table 1 :
8 pre-training models and their differences.For space constraint of the table, fine-tuning strategies of different models are described as follows: Skip-thoughts, quick-thoughts, and infersent regard pre-trained models as feature extractors.The parameters before output layer are frozen.CoVe and ELMO transfer word embedding to downstream tasks, with other parameters in neural networks uninitialized.ULMFit, GPT, and BERT fine-tune entire networks on downstream tasks.

Table 2 :
The performance of HuggingFace's implementation and UER's implementation on GLUE benchmark.

Table 3 :
The performance of ERNIE's implementation and UER's implementation on ERNIE benchmark.

Table 4 .
BERT baseline (BERT-base Chinese) is pre-trained upon Chinese Wikipedia.We can observe that pre-training on Amazon review corpus can improve the results significantly.Using CLS target achieves the best results in most cases.

Table 4 :
Performance of pre-training models with different targets.

Table 5 :
Table 5 lists the results of different encoders.Compared with BERT baseline, LSTM encoder can achieve comparable or even better results when proper corpora and targets are selected.Performance of pre-training models with different encoders.