Task-oriented Domain-specific Meta-Embedding for Text Classification

Meta-embedding learning, which combines complementary information in different word embeddings, have shown superior performances across different Natural Language Processing tasks. However, domain-speciﬁc knowledge is still ignored by existing meta-embedding methods, which results in unstable performances across speciﬁc domains. More-over, the importance of general and domain word embeddings is related to downstream tasks, how to regularize meta-embedding to adapt downstream tasks is an unsolved problem. In this paper, we propose a method to incorporate both domain-speciﬁc and task-oriented information into meta-embeddings. We conducted extensive experiments on four text classiﬁcation datasets and the results show the effectiveness of our proposed method.


Introduction
Building semantic representations (Zhao et al., 2017;Li et al., 2019;Neill and Bollegala, 2020) of words is a vital procedure in various Natural Language Processing (NLP) tasks. Over recent years, many pre-trained word embeddings have emerged, such as pre-trained Word2Vec (Mikolov et al., 2013) and pre-trained Glove (Pennington et al., 2014). Despite their usefulness, some previous works find that the performance of different pre-trained word embeddings has significant variation for different tasks Hill et al., 2014). To obtain a stable and better performance, Yin and Schütze (2015) proposed the meta-embedding learning task that aims to obtain a robust and superior word embedding (i.e., metaembedding) by combining the different pre-trained word embeddings.
Most previous meta-embedding methods neglect the importance of domain-specific information and * Corresponding author use the same embedding for each word in all domain-specific datasets (Bollegala and Bao, 2018;Coates and Bollegala, 2018;Bollegala et al., 2017). It is beneficial to incorporate domain-specific information into general word embeddings and provide different word representations for different domains, which has been shown to improve the performance in some other tasks (Bollegala et al., 2015;Xu et al., 2018).
This leads us to explore how to combine general and domain-specific information in metaembedding learning. Intuitively, the importance of the general and domain embeddings depends on a specific domain. For example, in the computer domain, for the domain-specific words (e.g., "mouse"), we should preserve their domain information but discard their general information. On the other hand, some general words (e.g., "we", "people") may not be able to get a high-quality domain embedding due to the insufficient domain data, in this situation, their general word embeddings are preferable. However, most previous metaembedding methods are unsupervised, it is hard to learn which embedding is preferable. We consider that it is necessary to use the supervision from a downstream task to address this limitation. Specifically, we focus on text classification (TC) and use the words' category distributions of a TC dataset to guide the meta-embedding learning process.
In this paper, we propose a supervised autoencoder method, named Task-oriented Domainspecific AutoEncoded Meta-Embedding (TDAEME), to learn meta-embedding for text classification. TDAEME combines both general and domain word embeddings in a supervised manner, which is implemented by a supervised autoencoder. Specifically, TDAEME predicts the words' category distribution. This makes the downstream classifier easier to extract useful information from our task-oriented domain-specific meta-embedding. We evaluate TDAEME on four text classification datasets, the results demonstrate the effectiveness of our method. Yin and Schütze (2015) first proposed a metaembedding learning method (1TON) to combine the complementary information of multiple pre-trained word embeddings into one metaembedding. Bollegala and Bao (2018) further improved 1TON by applying an autoencoder framework and three different objective functions to model multiple pre-trained word embeddings. The three new models are called DAEME, CAEME, and AAEME respectively. Bollegala et al. (2017) proposed an unsupervised locally linear method for learning meta-embeddings from a set of source embeddings. However, all the above methods only model the information in pre-trained word embeddings which were trained on unlabeled text but ignore the domain information. One similar work called dynamic meta-embedding which proposed by Kiela et al. (2018) aims to address the meta-embedding learning as a supervised learning paradigm. However, their method is builtin downstream models, which is quite different from our proposed method. Our method is modelindependent, the obtained meta-embeddings can be used in any downstream models as features.

Related Work
One contemporary work also uses the supervised autoencoder method for meta-embedding learning (O'Neill and Bollegala, 2020). However, their motivation and contribution are different from ours. O'Neill and  aim to enhance the meta-embedding with words' similarity information, so they use the similarity score between words as the supervision signal in meta-embedding learning, while we focus on a more specifically task (i.e., text classification), our model uses words' categories information as the supervision signal, which is specifically designed for the classification task. In text classification task, words within the same category should be close to each other in the representation space, using similarity information may make two words in different categories get closer (e.g., "learning" and "education" with high similarity but mainly appear in two different categories "AI" and "sociology" respectively).

Method
Suppose that we have a word embedding set S = {S 1 , S 2 , ..., S n } with a vocabulary V ; a labeled text classification dataset X which contains a training set X train and a test set X test , we denote its vocabulary as V X and its categories as C X with |C X | = L. We aim to learn the word task-oriented domain-specific meta-embedding m(w) for each word w ∈ V ∩ V X . The architecture of TDAEME is visualized in Figure 1

Extraction Component
The extraction component is used to project different word embeddings into one coherent vector space. For each word w in the source embedding set vocabulary V , S i (w) denotes the i source embedding of word w, we first use n encoders to extract the semantic information of each source embedding into an d M dimensional vector space, denote as E i (w): where f i is the i encoder function for the i source embedding. Then we compute the task-oriented domain-specific meta-embedding m(w) of word w: where ⊕ is the concatenation operator.

Reconstruction Component
In this component, we take the m(w) as input, then predict all n source embeddings D i (w): where g i is the i decoder function to predict the i source embedding from the m(w). The objective of this component can be represented as L R : where S i (w) and D i (w) is the i source and predict embedding of word w, λ i is a hyperparameter to adapt the weight of different source embeddings.

Adaption Component
In this component, we make the m(w) predict its category distribution of a downstream dataset. This makes words with the same category would get close in meta-embedding vector space. Formally, For each word w both in vocabulary V (the vocabulary of all source embeddings) and V X (the vocabulary of the classification dataset X), its category distribution can be defined as T X (w): where T C j X (w) is the document frequency of word w in jth category, t C j X is the number of documents that contain w in the class C j .
An extra decoder is employed to predict the category distribution P X (w) of word w from the m(w): The objective of this component L A can be represented as:

Joint Learning
The extraction component is shared between the reconstruction component and the adaption component, we propose to use the joint learning framework to jointly optimizing L R and L A . Then we obtained the final objective function L: where α is a hyperparameter to adapt the reconstruction component and the adaption component.

Source Word Embeddings
We use the Glove 1 and CBOW 2 as the two general word embeddings in our experiments.

Datasets
To evaluate the effectiveness of our proposed model TDAEME, we conduct extensive Experiments on four English text classification datasets: 20News-Group 3 (Lang, 1995), 5AbstractsGroup 4 , IMDB 5 (Maas et al., 2011), TREC 6 (Li and Roth, 2002).The statistics of the datasets are give in the Table 1. We didn't split a validation set, see details in 4.4

Baseline Methods
We consider the following meta-embedding approaches as baselines: (1) Concatenation  (CONC) Yin and Schütze (2015) propose that the concatenation of the source embeddings is an effective method for creating meta-embeddings.
(2) Averaging (AVG) Coates and Bollegala (2018) proposed averaging the source word embeddings for a word as a method for creating metaembeddings without increasing the representation dimensionality.
(3) origin AEMEs Bollegala and Bao (2018) proposed three autoencoderbased approaches DAEME, CAEME, and AAEME for learning meta-embeddings from multiple pretrained source embeddings.We use the code 7 released by the authors in our experiments. (4) LLE Bollegala et al. (2017) proposed an unsupervised locally linear method for learning meta-embeddings from a set of source embeddings. We use the code 8 released by the authors in our experiments.

Experimental Settings
We use the average of word embeddings to represent the document. We trained a linear classifier using Liblinear (Fan et al., 2008) to test the classification performance of each embedding. Since the goal is to evaluate the embeddings, so we didn't tune the hyperparameters of the classifier on a validation set and just evaluate the test set performance with default hyperparameters. To train our proposed model TDAEME, we use a linear neural layer with the ReLU (Nair and Hinton, 2010) activation function as an encoder and a linear neural 7 https://github.com/CongBao/AutoencodedMetaEmbedding 8 https://github.com/LivNLP/LLE-MetaEmbed layer as a decoder. We employ Adam (Kingma and Ba, 2014) with mini-batches of size 128 and 0.001 learning rate as an optimizer. We also applied masking noises (Vincent et al., 2010) to randomly set 0.05% of the input elements to zero. α is set to 1e-4. We manually tuned the hyperparameters of TDAEME according to the training loss (i.e., equation 4 8 9) of TDAEME. The computing infrastructure we used is a PC with GTX 980Ti.

Result
Overall Performance We use accuracy 9 as metric in our experiments. Table 2 shows the evaluation results. Compared with two general source embeddings, meta-embedding learning methods perform better in most cases, which demonstrates the effectiveness of meta-embedding methods. Moreover, fine-tuning meta-embedding learning methods (i.e., AEMEs) have better performance than none-learning methods (i.e., CONC and AVG). Compared with three origin AEME models and LLE, our proposed method TDAEME can make a further improvement in the text classification task, which demonstrates the effectiveness of the domain-specifc and task-orietend information.
Ablation The last 5 rows in Table 2 shows the ablation results. In most cases, combining one more high-quality general word embeddings will never harm the performance. While the results of the last two ablation methods indicate that both the domain embeddings and the adaption component provide a significant boost compared to the raw AEMEs. Moreover, TDAEME achieves the best results among all ablation methods. This indicates the domain-specific and task-oriented information are beneficial to each other, our joint learning method can successfully model these two types of information.

Impact of Dimensional
We also conducted an experiment on meta-embedding dimensionalities. We investigate the performance of AAEME and TDAEME on TREC dataset with 100, 200, and 300 meta-embedding dimensions respectively. The results are shown in Figure 2. We find that TDAEME outperforms AAEME in all cases and TDAEME is less sensitive to dimension reduction than AAEME.

Compared with Contextualized Embeddings
Contextualized Embeddings such as BERT, ELMo can outperform previous state-of-the-art models on multiple natural language understanding (NLU) benchmarks. We conduct an experiment to compared our TDAEME with ELMo (Peters et al., 2018). To make a fair comparsion, we use ELMo to get sentence embedding, and performance classification with the same SVM classifier. Table 3 shows the results. We observe that our TDAEME can achieve competitive performance against the contextualized embeddings.

Conclusion
In this paper, we propose a meta-embedding learning approach called Task-oriented Domain-specific Autoencoded Meta-Embedding (TDAEME), which leverages task-oriented supervision to improve the combination of general and domain embeddings.
We conducted experiments on four text classification datasets and the results show the effectiveness of our proposed method.