Single Model Ensemble using Pseudo-Tags and Distinct Vectors

Model ensemble techniques often increase task performance in neural networks; however, they require increased time, memory, and management effort. In this study, we propose a novel method that replicates the effects of a model ensemble with a single model. Our approach creates K-virtual models within a single parameter space using K-distinct pseudo-tags and K-distinct vectors. Experiments on text classification and sequence labeling tasks on several datasets demonstrate that our method emulates or outperforms a traditional model ensemble with 1/K-times fewer parameters.


Introduction
A model ensemble is a promising technique for increasing the performance of neural network models (Lars. and Peter., 1990;Anders and Jesper, 1994). This method combines the outputs of multiple models that are individually trained using the same training data. Recent submissions to natural language processing(NLP) competitions are primarily composed of neural network ensembles (Bojar et al., 2018;Barrault et al., 2019). Despite its effectiveness, a model ensemble is costly. Because it handles multiple models, it requires increased time for training and inference, increased memory, and greater management effort. Therefore, the model ensemble technique cannot always be applied to real systems, as many systems, such as edge devices, must work with limited computational resources.
In this study, we propose a novel method that replicates the effects of the ensemble technique with a single model. Following the principle that aggregating multiple models improves performance, we create multiple virtual models in a shared space. Our method virtually inflates the training data K times with K-distinct pseudo-tags  appended to all input data. It also incorporates Kdistinct vectors, which correspond to pseudo-tags. Each pseudo-tag k ∈ {1, . . . , K} is attached to the beginning of the input sentence, and the k-th vector is added to the embedding vectors for all tokens in the input sentence. Fig. 1 presents a brief overview of our proposed method. Intuitively, this operation allows the model to shift the embedding of the same data to the k-th designated subspace and can be interpreted as explicitly creating K virtual models in a shared space. We thus expect to obtain the same (or similar) effects as the ensemble technique composed of K models with our K virtual models generated from a single model. Experiments in text classification and sequence labeling tasks reveal that our method outperforms single models in all settings with the same parameter size. Moreover, our technique emulates or surpasses the normal ensemble with 1/K-times fewer parameters on several datasets.

Related Work
The neural network ensemble is a widely studied method (Lars. and Peter., 1990;Anders and Jesper, 1994;Hashem, 1994;Opitz and Shavlik, 1996); however studies have focused mainly on improving performance while ignoring cost, such as computational cost, memory space, and management cost.
Several methods have overcome the shortcomings of traditional ensemble techniques. For training Snapshot Ensembles, (Huang et al., 2017) used a single model to construct multiple models by converging into multiple local minima along the optimization path. For inference distillation, (Hinton et al., 2015) transferred the knowledge of the ensemble model into a single model. These methods use multiple models either during training or inference, which partially solves the negative effects of the traditional ensemble.
The incorporation of pseudo-tags is a standard technique widely used in the NLP community, (Rico et al., 2016;Melvin et al., 2017). However, to the best of our knowledge, our approach is the first attempt to incorporate pseudo-tags as an identification marker of virtual models within a single model.
The most similar approach to ours is dropout (Srivastava et al., 2014), which stochastically omits each hidden unit during each mini-batch, and in which all units are utilized for inference. Huang et al. (2017) interpreted this technique as implicitly using an exponential number of virtual models within the same network. As opposed to dropout, our method explicitly utilizes virtual models with a shared parameter, which is as discussed in Section 5, complementary to dropout.

Base Encoder Model
The target tasks of this study are text classification and sequence labeling. The input is a sequence of tokens (i.e., a sentence). Here, x t denotes the one-hot vector of the t-th token in the input. Let E ∈ R D×|V| be the embedding matrices where D is the dimension of the embedding vectors and V is the vocabulary of the input.
We obtain the embedding vector e t at position t by e t = Ex t . Here, we introduce the notation e 1:T to represent the list of vectors (e 1 , e 2 , . . . , e T ) that correspond to the input sentence, where T is the number of tokens in the input. Given e 1:T , the feature (or hidden) vectors h t ∈ R H for all t ∈ {1, . . . , T } are computed as an encoder neural network ENC(·), where H denotes the dimensions of the feature vector. Namely, Finally, the output y given input x 1:T is estimated as y = φ (h 1:T ) where φ (·) represents the task dependent function (e.g., a softmax function for text classification and a conditional random field layer for sequence labeling). It should be noted that the form of the output y differs depending on the target task.

Single Model Ensemble using Pseudo-Tags and Distinct Vectors
In this section, we introduce the proposed method, which we refer to as SINGLEENS. Fig. 1 presents an overview of the method. The main principle of this approach is to create different virtual models within a single model. We incorporate pseudo-tags and predefined distinct vectors. For the pseudo-tags, we add special tokens { k } K k=1 to the input vocabulary, where hyper-parameter K represents the number of virtual models. For the predefined distinct vectors, we leverage mutually orthogonal vectors {o k } K k=1 , where the orthogonality condition requires satisfying o k · o k 0 for all (k, k ) when k = k . Finally, we assume that all input sentences start from one of the pseudo-tags. We then add the corresponding orthogonal vector o k of the attached pseudo-tag k to the embedding vectors at all positions. The new embedding vectorẽ 0:T is written in the following form: (2) We substitute e 1:T in Eq. 1 byẽ (k) 0:T in the proposed method.
An intuitive explanation of the role of pseudotags is to allow a single model to explicitly recognize differences in homogeneous input, while the purpose of orthogonal vectors is to linearly shift the embedding to the virtual model's designated direction. Therefore, by combining these elements, we believe that we can define virtual models within a single model and effectively use the local space for each virtual model. Aggregating these virtual models can then result in imitation of ensemble.

Experiments
To evaluate the effectiveness of our method, we conducted experiments on two tasks: text classification and sequence labeling. We used the IMDB (Andrew et al., 2011), Rotten (Bo and Lillian, We used the Transformer model (Vaswani et al., 2017) as the base model for all experiments, and its token vector representations were then empowered by pretrained vectors of GloVe, (Jeffrey et al., 2014), BERT (Devlin et al., 2018), or ELMo (Matthew et al., 2018). The models are referred to as TFM:GLOVE, TFM:BERT, and TFM:ELMO, respectively. 1 For TFM:BERT, we incorporated the feature (or hidden) vectors of the final layer in the BERT model as the embedding vectors while adopting drop-net technique (Zhu et al., 2020). All the models have dropout layers to assess the complementarity of our method and dropout.
We compared our method (SINGLEENS) to a single model (SINGLE), a normal ensemble (NORMALENS), and a normal ensemble in which each component has approximately 1/K parameters 2 (1/K ENS). 3 Although other ensemble-like methods discussed in Section 2 could have been compared (e.g., snapshot ensemble, knowledge distillation, or dropout during testing to generate predictions and aggregate them), they are imitations of a normal ensemble, and we assumed that the results of a normal ensemble were upper-bound. We used K = 9 for reporting the primary results of NOR-1 See Appendix A for detailed experimental settings. 2 Because BERT requires a fixed number of parameters, we did not reduce the parameters accurately for 1/K TFM:BERT.
3 See Appendix A for detailed experimental settings.  Table 2: Test F1 score and parameter size for sequence labeling tasks. Similarly to NORMALENS, SIN-GLEENS improved the score even at high performance levels.
MALENS, 1/K ENS, and SINGLEENS. We thus prepared nine pseudo-tags { k } 9 k=1 in the same training (trainable) and initialization manner as other embeddings. We created untrainable distinct vectors {o k } 9 k=1 using the implementation by Saxe et al. (2013) that was prepared in PyTorch's default function, torch.nn.init.orthogonal. We empirically determined the correct scaling for the distinct vectors as 1 out of 1, 3, 5, 10, 30, 50, 100, and the scale that was closest to the model's embedding vectors. We obtained the final predictions of K ensemble models by averaging and voting the outputs of individual models for text classification and sequence labeling, respectively. The results were obtained by the averaging five distinct runs with different random seeds.

Evaluation of text classification
Data We followed the settings used in the implementation by Kiyono et al. (2018) for data partition. 4 Our method, SINGLEENS inflates the training data by K times. During the inflation, the k-th subset is sampled by bootstrapping (Efron and Tibshirani, 1993) with the corresponding k-th pseudotag. For NORMALENS and 1/K ENS, we attempted both bootstrapping and normal sampling, and a higher score was reported.  for SINGLE and SINGLEENS, respectively, for TFM:GLOVE, and 0.34 and 0.11, respectively, for TFM:BERT. These results support the claim that explicit operations for defining K virtual models have a significant effect for a single model and are complementary to normal dropout. Through the series of experiments, we observed that the number of iterations of SINGLEENS was 1.0˜1.5 times greater than that of SINGLE.

Evaluation of sequence labeling
Data We followed the instructions of the task settings used in CoNLL-2000 andCoNLL-2003. 5 We inflated the training data by nine times for SIN-GLEENS, and normal sampling was used for NOR-MALENS and 1/K ENS. Because bootstrapping was not effective for the task, the results were omitted.
Results As displayed in Table 2, SINGLEENS surpassed SINGLE by 0.44 and 0.14 on CoNLL-2003 and CoNLL-2000, respectively, for TFM:ELMO with the same parameter size. However, NORMALENS produced the best results in this setting. The standard deviations of the single model and our methods were 0.08 and 0.05, respectively, on CoNLL-2000. Through the series of experiments, we observed that the number of iterations of SINGLEENS was 1.0˜1.5 times greater than that of SINGLE.

Analysis
In this section, we investigate the properties of our proposed method. Unless otherwise specified, we use TFM:BERT and TFM:ELMO on IMDB and CoNLL-2003 for the analysis.

Significance of pseudo-tags and distinct vectors
To assess the significance of using both pseudo-5 The statistics of the datasets are presented in Appendix B.  tags and distinct vectors, we conducted an ablation study of our method, SINGLEENS. We compared our method with the following three settings: 1) Only pseudo-tags, 2) Random distinct vectors, and 3) Random noise. In detail, the first setting (Only pseudo-tags) attached the pseudo-tags to the input without adding the corresponding distinct vectors. The second setting (Random distinct vectors) randomly shuffles the correspondence between the distinct vectors and pseudo-tags in every iteration during the training. Additionally, the third setting (Random noise) adds random vectors as the replacement of the distinct vectors to clarify whether the effect of incorporating distinct vectors is essentially identical to the random noise injection techniques or explicit definition of virtual models in a single model. Table 3 shows the results of the ablation study. This table indicates that using both pseudo-tags and distinct vectors, which matches the setting of SIN-GLEENS, leads to the best performance, while the effect is limited or negative if we use pseudo-tags alone or distinct vectors and pseudo-tags without correspondence. Thus, this observation explains that the increase in performance can be attributed to the combinatorial use of pseudo-tags and distinct vectors, and not merely data augmentation.
We can also observe from Table 3 that the performance of SINGLEENS was higher than that of 3) Random noise. Note that the additional vectors by SINGLEENS are fixed in a small number K while those by Random noise are a large number of different vectors. Therefore, this observation supports our claim that the explicit definition of virtual models by distinct vectors has substantial positive effects that are mostly irrelevant to the effect of the random noise. This observation also supports the assumption that SINGLEENS is complementary to dropout. Dropout randomly uses sub-networks by stochastically omitting each hidden unit, which can be interpreted as a variant of Random noise.  Moreover, it has no specific operations to define an explicitly prepared number of virtual models as SINGLEENS has. We conjecture that this difference yields the complementarity that our proposed method and dropout can co-exist.
Vector addition We investigated the patterns with which distinct vectors should be added: 1) Emb, 2) Hidden, and 3) Emb + Hidden. Emb adds distinct vectors only to the embedding, while Hidden adds distinct vectors only to the final feature vectors. Emb + Hidden adds distinct vectors to both the embedding and final feature vectors. As illustrated in Table 4, adding vectors to the embedding is sufficient for improving performance, while adding vectors to hidden vectors has as adverse effect. This observation can be explained by the architecture of Transformer. The distinct vectors in the embedding are recursively propagated through the entire network without being absorbed as non-essential information since the Transformer employs residual connections (He et al., 2015).
Comparison with normal ensembles To evaluate the behavior of our method, we examined the relationship between the performance and the number of models used for training. Our experiments revealed that having more than nine models did not result in significant performance improvement; thus, we only assessed the results up to nine models. Figs 2 and 3 present the metrics on Rotten and CoNLL-2003, respectively. The performance of our method increased with the number of models, which is a general feature of normal ensemble. Notably, on Rotten, the accuracy of our method rose while that of other methods did not. Investigation of this behavior is left for future work.

Conclusion
In this paper, we propose a single model ensemble technique called SINGLEENS. The principle of SINGLEENS is to explicitly create multiple virtual models in a single model. Our experiments demonstrated that the proposed method outperformed single models in both text classification and sequence labeling tasks. Moreover, our method with TFM:BERT surpassed the normal ensemble on the IMDB and Rotten datasets, while its parameter size was 1/K-times smaller. The results thus indicate that explicitly creating virtual models within a single model improves performance. The proposed method is not limited to the two aforementioned tasks, but can be applied to any NLP as well as other tasks such as machine translation and image recognition. Further theoretical analysis can also be performed to elucidate the mechanisms of the proposed method.