Text Categorization by Learning Predominant Sense of Words as Auxiliary Task

Distributions of the senses of words are often highly skewed and give a strong influence of the domain of a document. This paper follows the assumption and presents a method for text categorization by leveraging the predominant sense of words depending on the domain, i.e., domain-specific senses. The key idea is that the features learned from predominant senses are possible to discriminate the domain of the document and thus improve the overall performance of text categorization. We propose multi-task learning framework based on the neural network model, transformer, which trains a model to simultaneously categorize documents and predicts a predominant sense for each word. The experimental results using four benchmark datasets show that our method is comparable to the state-of-the-art categorization approach, especially our model works well for categorization of multi-label documents.


Introduction
Text categorization has been intensively studied since neural network methods have attracted much attention. Most of the previous work on text categorization relies on the use of representation learning where the words are mapped to an implicit semantic space Liu et al., 2017a). The Word2Vec is a typical model related to this representation (Mikolov et al., 2013). It learns a vector representation for each word and captures semantic information between words. Pre-training by using the model shows that it improves overall performance in many NLP tasks including text categorization. However, the drawback in the implicit representation is that it often does not work well on polysemous words.
The sense of a word depends on the domain in which it is used. The same word can be used differently in different domains. Distributions of the senses of words are often highly skewed and a predominant sense of a word depends on the domain of a document (McCarthy et al., 2007;Jin et al., 2009). Suppose the noun word, "court". The predominant sense of a word "court" would be different in the documents from the "judge/law" and "sports" domains as the sense of the former would be "an assembly (including one or more judges) to conduct judicial business" and the latter is "a specially marked horizontal area within which a game is played" described in the WordNet 3.1. This indicates that the meaning becomes a strong clue to assign a domain to the document. However, in the implicit semantic space created by using the neural language model such as the Word2Vec, a word is represented as one vector even if it has several senses.
It is often the case that a word which is polysemous is not polysemous in a restricted subject domain. A restriction of the subject domain makes the problem of polysemy less problematic. However, even in texts from a restricted subject domain such as Wall Street Journal corpus (Douglas and Janet, 1992), one encounters quite a large number of polysemous words. Several authors focused on the problem and proposed a new type of deep contextualized word representation such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) that models not only syntax but also semantics including polysemies. Their methods work very well in many NLP tasks such as question answering and sentiment analysis, while their methods are unsupervised manners which they do not explicitly map each sense of a word to its domain. Motivated by solving this problem, we propose a method for text categorization that complements implicit representation by leveraging the predominant sense of a word.
We propose a multi-task learning method based on the encoder structure of the neural network model, transformer (Vaswani et al., 2017). The transformer works by relying on a self-attention mechanism. It can directly capture the relationships between two words regardless of their distance which is effective for detecting features to discriminate predominant sense of a word in the domain. In the model using multi-task learning, the auxiliary predominant sense prediction task helps text categorization by learning common feature representation of predominant senses for text categorization. The model adopts a multi-task objective function and is trained to simultaneously categorize texts and predicts a predominant sense for each word. In such a way, the predominant sense information can also help the model to learn better sense/document representations. The experimental results using four benchmark datasets support our conjecture that predominant sense identification helps to improve the overall performance of the text categorization task.
The main contributions of our work can be summarized: (1) We propose a method for text categorization that complements implicit representation by leveraging a predominant sense of a word.
(2) We introduce a multi-task learning framework based on the neural network model, transformer.
(3) We show our hypothesis that predominant sense identification helps to improve the overall performance of the text categorization task, especially our model is effective for categorization of documents with multi-label.

Text Categorization Framework
Our multi-task learning framework for predominant sense prediction and text categorization is illustrated in Figure 1.

Text Matrix by the Transformer Encoder
As shown in Figure 1, we use the transformer encoder to represent the text matrix (Vaswani et al., 2017). It is based on self-attention networks and each word is connected to any other word in the same sentence via self-attention which makes it possible to get rich information to predict domainspecific senses.
The encoder e typically stacks six identical layers. Each layer uses the multi-head attention and two sub-layers feed-forward network, combined with layer normalization and residual connection. For each word within a sentence, including the word itself, the multi-head attention computes at- Figure 1: Multi-task Learning for Predominant Sense Prediction and Text Categorization: "make" and "bank" marked with red show the target word. "make%2:40:01::" and "bank%1:14:00::" show sense index obtained by the WordNet 2.0 and indicate the predominant sense of "make" and "bank" in the economy domain, respectively. tention weights, i.e., a softmax distribution shown in Eq. (1).
The input are queries Q, keys K of dimension d k , and values V of dimension d v . √ d k refers to scaling factor. The inputs are linearly projected h times, in order to allow the model to jointly attend to information from different representation, concatinating the result, Let the output of multiHead(Q, K, V) be M attn . On top of the multi-head attention, there is a feed-forward network that consists of two layers with a ReLU activation. Each encoder layer takes the output of the previous layer as input. It allows making attention to all positions of the previous layer. We obtain the output matrix M trf shown in Figure 1 as an output of the encoder of the transformer.

Domain-Specific Sense Prediction
Each target word vector, i.e., the word which should be assigned a domain is extracted from the matrix M trf and passed to the fully connected layer FC dss . In Figure 1, "make" and "bank" denote the target words. The weighted matrix of FC dss is indicated as W dss ∈ R d model ×d dss where d dss is the number of the dimensions in the output which is equal to the number of domain-specific senses in all of the target words. The predicted sense vector y (dss) is obtained as below: We compute loss function by using y (dss) and its true domain-specific sense vector t (dss) which is represented as a one-hot vector. The loss function is defined by Eq. (4).
n refers to the minibatch size and n w shows the number of words in a document. n dss is the number of target words within the minibatch size and θ refers to the parameter used in the network. t iws show the value of the s-th domainspecific sense for the w-th target word in the i-th document within the minibatch size and its true value (1 or 0), respectively. As shown in Figure 1, we obtain text matrix M dss by replacing each target vector ("make" and "bank") in the matrix M trf to its domain-specific sense vector ("make%2:40:01::" and "bank%1:14:00::").

Text Categorization
We merged all the vectors of the matrix M dss per dimension and obtained one document vector D sum . We passed it to the fully connected layers FC tc . The number of the dimensions of the output vector d tc obtained by FC tc equals to the total number of domains. Let the prediction vector y (tc) be W tc × D sum where W tc ∈ R d model ×dtc indicates the weight matrix of FC tc . We applied softmax function for single label categorization task which is defined by: Similarly, we used a sigmoid function σ(x) = 1 1+e −x for multi-label categorization problem. The training objective is to minimize the following loss: Single-label and Multi-label in Eq. (6) denote the loss function for single-label and multi-label prediction, respectively. n refers to the minibatch size and θ shows parameter used in the network. t In case of a single domain, a domain whose probability score is the maximum is regarded to the predicted domain. When the test data is the multi-label problem, we set a threshold value λ and domains whose probability score exceeds the threshold value are considered for selection.

Multi-task Learning
We assume that the auxiliary predominant sense prediction task helps the text categorization task by learning common feature representation of predominant senses for text categorization. The model adopts a multi-task objective function which is shown in Eq. (7). It is trained to simultaneously categorize texts and predicts a predominant sense for each word.
θ (sh) in Eq. (7) refers to a shared parameter of the two tasks. θ (dss) and θ (tc) stand for a parameter estimated in domain-specific sense prediction and that of text categorization, respectively. Given a corpus, the parameters of the network are trained to minimize the value obtained by Eq. (7).

Dataset
We performed the experiments on four benchmark datasets having domains to evaluate the properties  The data for domain-specific sense prediction is based on the senses provided by the allwords task in SensEval-2 (Palmer et al., 2001) and SensEval-3 (Snyder and Palmer, 2004). Magnini et al (Magnini and Cavaglia, 2000;Magnini et al., 2002) created a lexical resource where WordNet 2.0 synsets were annotated with Subject Field Codes (SFC). Especially, 96% of WordNet synsets for nouns are annotated. We assigned each domain described in their SFC list to the sense of the all-words task in SensEval-2 and SensEval-3 data. Moreover, we assigned SFC labels to four benchmark datasets having domains. The SFC consists of 115,424 words assigning 168 domain labels which include some of the four datasets' domains. We manually corresponded these domains to SFC labels which are shown in Tables 1, 2, 3 5 , and 4.
The dataset statistics are summarized in Table  5 and examples of domain-specific sense-tagged    20th, 199620th, to Aug 19th, 1997. RCV1 is a large volume of data compared to the other three data. We thus reserved eight months of the RCV1 data to learn word-embedding model. The model is also used for the other three datasets because they are the same genre as the RCV1, news stories. We divided the remaining data into three. The division is the same as the other three datasets: we reserved 60% of the data to train the models, 20% of the data is used for tuning hyperparameters, and the remaining 20% is used to test the models. All the documents are tagged by using Stanford CoreNLP Toolkit (Manning et al., 2014).

Baselines
We compared our method to three baseline methods: (i) TRF-Single which is a text categorization based on the transformer but without domainspecific sense prediction, (ii) TRF-Sequential, a method first predicts domain-specific senses and then classify documents by using the result, and (iii) TRF-Delay-Multi, which is a model to start learning predominant sense model at first until the stable, and after that it adapts text categorization simultaneously. This is a mixed method of TRF-Sequential with fully separated training and TRF-Multi with fully simultaneously training. We compared our method with these approaches. For multi-label text categorization by using RCV1 data, we chose XML-CNN as a baseline method because their method is simple but powerful and attained at the best or second best compared to the seven existing methods including Bow-CNN (Johnson and Zhang, 2015) on six    benchmark datasets where the label-set sizes are up to 670K (Liu et al., 2017a). Original XML-CNN is implemented by using Theano, 6 while we implemented our method by Chainer. 7 To avoid the influence of the difference in libraries, we implemented XML-CNN by Chainer and used it as a baseline. We followed the author-provided implementation in our Chainer's version of XML-CNN. To make a fair comparison, we used fast-Text (Joulin et al., 2017) as a word-embedding tool with all of the methods.

Model settings and evaluation metrics
The hyperparameters which are commonly used in all of the methods and their own estimated hyperparameters are shown in Tables 7 and 8, respec-tively 8 . These hyperparameters are optimized by using a hyperparameter optimization framework called Optuna 9 . They were independently determined for each dataset. In the experiments, we run five times for each model and obtained the averaged performance. We used standard recall, precision, and F1 measures. We further computed Macro-averaged F1 and Micro-averaged F1 and used them through the experiments.

Results
The performance of all methods in Microaveraged F1 and Macro-averaged F1 on four datasets are summarized in Tables 9, and 10, respectively. Overall, both Micro and Macroaveraged F1 obtained by each method were very high except for the RCV1 data. Because these datasets consist of at most five domains and a single-label problem. The Micro and Macro-F1 obtained by TRF-Single were better than those obtained by XML-CNN except for APW corpus. This shows that text categorization based on the encoder of the transformer is effective for categorization. Sequential learning does not work well for text categorization. Because the average Macro-F1 obtained by TRF-Sequential (89.41%) was slightly worse than that of TRF-Single (89.74%), while Micro-averaged F1 obtained by TRF-Sequential (90.02%) was slightly better than TRF-Single (89.89%). TRF-Delay-Multi was worse than TRF-Sequential. Especially, as shown in Tables 9 and 10, the results in RCV1 were worse than TRF-Single. One possible reason for the result is that predominant sense identification is more difficult task compared with text categorization. As shown in Table 5 Table 8: Model settings for each method: "TRF-Seq." and "TRF-Delay" show TRF-Sequential and TRF-Delay-Multi, respectively. "f r" refers to filter region and "f " shows Filters. "wd" indicates Weight Decay. "h" shows multi-attention layers and "e" is a stack of encoders. "ep" refers to the number of epochs in the predominant sense prediction used in the baseline (iii). For instance, 75 indicates that we run predominant sense prediction task until 75 epochs, and then run multi-task learning.     Overall, the results obtained by TRF-Multi were the best among them by both Micro and Macroaveraged F1. This indicates that the predominant sense information through multi-task learning can help the model to learn better sense/document representations. On RCV1, the overall performance in each method was worse than those obtained by using other data as the categorization task is more difficult task compared with other data, i.e., multilabel problem. However, TRF-Multi is still better than other methods. The improvement was 1.49% ∼ 9.49% by Micro-F1 and 1.79% ∼ 15.23% by Macro-F1.  Tables 11 and 12 show the Micro and Macro-F1 of predominant sense prediction, respectively. The overall performance of multi-task learning was better to those of TRF-Seq. (TRF-Delay) by both measures except for Micro-F1 on AG data. This confirms our conjecture: to train the data in order to simultaneously categorize texts and predict domain-specific senses is effective for sense prediction.

Figures 2 and 3 show Micro and Macro-F1
against the number of epochs by using each of the four datasets. As we can see from these Figures, on 20News and AG corpus, each model except for XML-CNN are similar learning stability in both Micro and Macro-F1 curves. On RCV1, we have the same observation by Micro-F1 except for TRF-Delay-Multi and there is no significant difference in stability between TRF-Multi and TRF-Sequential by Macro-F1. On APW, TRF-Multi is similar to XML-CNN as they are stable after 60 epochs. In summary, TRF-Multi gets more stable through the datasets and in both measures.
We also examined the affection on each categorization performance by the ratio of predominant-sense tagged training data. For each domain and each predominant-sense, we count the total number of documents and obtained 5% to 80% of the training documents. The results by Micro and Macro-F1 are illustrated in Figures 4, and 5, respectively.
The Micro-F1 values except for 20News and for TRF-Delay-Multi on RCV1 are not a significant difference among methods and keep the performance until the ratio of training data decreased at 40%. Similarly, when the ratio is larger than 20%, the Macro-F1 on APW and AG obtained by all the methods do not differ significantly except for XML-CNN. The Micro and Macro-F1 curves obtained by 20news and Macro-F1 curve on RCV1 shows that more training data helps the performance. This is reasonable because the average number of training data per domain on 20news is 3,409 and it is extremely smaller than other datasets. RCV1 is also a multi-label problem.
The curves obtained by TRF-Multi drop slowly compared to other methods and it keeps the best performance by both evaluation measures and even in the ratio of 5%. From the observations,

Related Work
Deep learning techniques have been great successes for automatically extracting contextsensitive features from a textual corpus. Many authors have attempted to apply deep learning methods including CNN (Kim, 2014;Zhang and Wallace, 2015;Wang et al., 2017), the attention based CNN (Yang et al., 2016), bag-of-words based CNN (Johnson and Zhang, 2015), and the combination of CNN and recurrent neural network (RNN) (Zhang et al., 2016) to text categorization. Most of these approaches demonstrated that neural network models are powerful for learning effective features from textual input. However, most of them for learning word vectors only allow a single context-independent representation for each word even if it has several senses. Peters et al. addressed the issue and proposed a model of deep contextualized word representation called ELMo derived from a bidirectional LSTM (Peters et al., 2018). They reported that their representation model significantly improves the state-of-the-art across six NLP problems. Similarly, Devlin proposed a model of deep contextualized word representation called BERT that can deal with syntax and semantics including polysemies (Devlin et al., 2018). Their methods attained amazing results in many NLP tasks. However, they do not explicitly map each sense of a word to its domain as their methods are unsupervised manner. Moreover, their model needs a large amount of corpus which leads to computational workload. Our model utilizes existing domain-specific senses (Magnini and Cavaglia, 2000;Magnini et al., 2002) as pseudo rough but explicit word representation data. It enables us to learn feature representations for both predominant senses and text categorization with a small amount of data.
Similar to the text categorization task, the recent upsurge of deep learning techniques have also contributed to improving the overall performance on Word Sense Disambiguation (WSD) (Yuan et al., 2016;Raganato et al., 2017;Peters et al., 2018). Melamud et al. proposed a method called Con-text2Vec which learns each sense annotation in the training data by using a bidirectional LSTM trained on an unlabeled corpus (Melamud et al., 2016). More recently, Vaswani et al. introduced the first full-attentional architecture called Transformer. It utilizes only the self-attention mechanism and demonstrated its effectiveness on neural machine translation.
Since then, the transformer has been successfully applied to many NLP tasks including semantic role labeling (Strubell et al., 2018) and sentiment analysis (Ambartsoumian and Popowich, 2018). To the best of our knowledge, this is the first approach for predicting domain-specific senses based on a transformer that is trained with multi-task learning.
In the context of predominant sense prediction, several authors have attempted to use domainspecific knowledge to disambiguate senses and show that the knowledge outperforms generic supervised WSD (Agirre and Soroa, 2009;Faralli and Navigli, 2012;Taghipour and Ng, 2015). McCarthy et al. proposed a statistical method for assigning predominant noun senses (McCarthy et al., 2004(McCarthy et al., , 2007. They find words with a similar distribution to the target word from parsed data. They tested 38 words containing two domains of Sports and Finance from the Reuters corpus (Rose et al., 2002). Similarly, Lau et al. (2014) proposed a fully unsupervised topic modeling-based approach to sense frequency estimation. Faralli and Navigli (2012) attempted to performing domain-driven WSD by a pattern-based method with minimally-supervised framework. While conceptually similar, our model differs from these approaches in that it is supervised learning by adopting existing domain-specific sense tags for creating the data.
In the context of multi-task learning, many authors have attempted to apply it to NLP tasks (Collobert and Weston, 2008;Glorot et al., 2011;Liu et al., , 2016. Liu et al. proposed adversarial multi-task learning which alleviates the shared and private latent feature spaces from interfering with each other (Liu et al., 2017b). Xiao et al. attempted multi-task CNN which introduces a gate mechanism to reduce the interference (Xiao et al., 2018). They reported that their approach can learn selection rules automat-ically and gain a great improvement over baselines through the experiments on nine text categorization datasets. Both of them focused on text categorization task only as a multi-task and used the word embeddings which are initialized with Word2Vec or GloVe vectors. Aiming at text categorization with relatively small amounts of training data, we demonstrated a predominant sense of a word is effective for text categorization in the framework of multi-task learning with domainspecific sense identification and text categorization. This enabled us to obtain better explicit feature representations to classify documents.

Conclusion
We have presented an approach to text categorization by leveraging a predominant sense of a word depending on the domain. We empirically examined that predominant sense identification helps to improve the overall performance of text categorization in the framework on multi-task learning. The comparative results with the baselines showed that our model is competitive as the improvement was 1.49% ∼ 9.49% by Micro-F1 and 1.79% ∼ 15.23% by Macro-F1. Moreover, we found that our model works well, especially for the categorization of documents with multi-label.
Future work will include: (i) incorporating lexical semantics such as named entities for further improvement, (ii) comparing our model to other deep contextualized word representation such as ELMO and BERT, and (iii) applying the method to other domains for quantitative evaluation.