Enhancing Local Feature Extraction with Global Representation for Neural Text Classification

For text classification, traditional local feature driven models learn long dependency by deeply stacking or hybrid modeling. This paper proposes a novel Encoder1-Encoder2 architecture, where global information is incorporated into the procedure of local feature extraction from scratch. In particular, Encoder1 serves as a global information provider, while Encoder2 performs as a local feature extractor and is directly fed into the classifier. Meanwhile, two modes are also designed for their interaction. Thanks to the awareness of global information, our method is able to learn better instance specific local features and thus avoids complicated upper operations. Experiments conducted on eight benchmark datasets demonstrate that our proposed architecture promotes local feature driven models by a substantial margin and outperforms the previous best models in the fully-supervised setting.


Introduction
Text classification is a fundamental task in natural language processing, which is widely used in various applications such as spam detection, sentiment analysis and topic classification. One of the mainstream approaches firstly utilizes explicit local extractors to identity key local patterns and classifies based on them afterwards. In this paper, we call this line of research as local feature driven models.
Lots of proposed methods can be grouped into this scope. Ngrams have been traditionally exploited in statistical machine learning approaches (Pang et al., 2002;Wang and Manning, 2012). For deep neural networks, encoding local features into low-dimensional distributed ngrams ? These authors contributed equally to this work. † This work was done while the author was an intern at Baidu Inc. Case1: Apple is really amazing! I am fed up to carry my clunky camera. Case2: Apple is famous around world and deserves to be called "nutritional powerhouses". embeddings (Joulin et al., 2016;Qiao et al., 2018) and simply bagging of them have been proved effective and highly efficient. Convolutional Neural Networks (CNN) (LeCun et al., 2010) are promising methods for their strong capacities in capturing local invariant regularities (Kim, 2014). More recently, Wang (2018) proposes the Disconnected Recurrent Neural Network (DRNN), which utilizes RNN to extract local features for larger windows and has reported best results on several benchmarks.
Despite having good interpretability and remarkable performance, current local feature extraction still has one shortcoming. As shown in Table 1, the real meaning of Apple can only be correctly recognized from overall view instead of narrow window. If the local extractor in charge of Apple cannot receive camera and nutritional from the very beginning, it would require complicated and costly upper structures to help revise the imprecisely local representation and create newer high-level features, such as deeply stacking (Johnson and Zhang, 2017;Conneau et al., 2016) and hybrid integration (Xiao and Cho, 2016). To a certain extend, it is inefficient and hard to train especially in the case of insufficient corpus.
To address this issue, we believe a more efficient approach is to optimize the local extraction process directly. In this paper, we propose a novel architecture named Encoder1-Encoder2 1 , which innovatively contains two encoders for the identical input sequence respectively, instead of using only one single encoder in previous work. Concretely, the Encoder1 can be any kind of neural network models designed for briefly grasping global background, while the Encoder2 should be a typical local feature driven model. The key point is, the earlier generated global representations from Encoder1 is then incorporated into the local extraction procedure of Encoder2. In this way, local extractors can notice more long-range information on the basis of its natural advantages. As a result, better instance specific local features can be captured and directly utilized for classification owing to global awareness, which means further upper complicated operations can be avoided.
We conduct experiments on eight public text classification datasets introduced by Zhang et al. (2015).
The experimental results show that our proposed architecture promotes local feature driven models by a substantial margin. In fullysupervised settings, our best models achieves new state-of-the-art performances on all benchmark datasets. We further demonstrate the ability and generalization of our architecture in the semisupervised domain.
Our contributions can be concluded as follows: 1. We propose a novel Encoder1-Encoder2 architecture, where better instance specific local features are captured by incorporating global representations into local extraction procedure.
2. Our architecture has great flexibility. Different associations among Encoder1, Encoder2 and Interaction Modes are studied, where any kind of combination promotes vanilla CNN or DRNN significantly.
3. Our architecture is more robust to the window size of local extractors and the corpus scale.

Related Work
Local Feature Driven Models FastText uses bag of n-grams embeddings as text representation (Joulin et al., 2016), which has been proved effective and efficient. Qiao et al. (2018) propose a new method of learning and utilizing task specific n-grams embeddings to conquer data sparsity.
1 Our code will be available at https://github.com/PaddleP addle/models/tree/develop/PaddleNLP/Research/EMNLP2019 -GELE. "GELE" is the abbreviation for Global Encoder and Local Encoder, i.e., Encoder1 and Encoder2 respectively. CNN (LeCun et al., 2010) are representative methods of this category. Convolution operators are performed at every window based location to extract local features, interleaved with pooling layer for capturing invariant regularities. From Kim (2014), CNN are widely used in text classification. In addition to shallow structure, very deep and more complex CNN based models have also been studied to establish long distance association. Examples are deep characterlevel CNNs Zhang et al. (2015); Conneau et al. (2016), deep pyramid CNN Johnson and Zhang (2017) and convolution-recurrent networks Xiao and Cho (2016), in which recurrent layers are designed on top of convolutional layers for learning long-term dependencies between local features.
CNN use simple linear operations on n-gram vectors of each window, which enlightens researchers to capture higher order local non-linear feature using RNN.  first replace convolution filters with LSTM for query classification. Wang (2018) proposes DRNN, which exploits large window size equipped with GRU.
To make full use of local and global information, Zhao et al. (2018) propose a sandwich network by carding a CNN in the middle of two LSTM layers, where the output of CNN provides local semantic representations while the top LSTM supplies global structure representations. However, the global information they mainly focus on is the syntax part, which is produced by reorganizing the already obtained local features. Besides, both of them are directly used for final classification, while we use pre-acquired global representations to help capture better local features. To the best of our knowledge, we are the first to incorporate global representation into the extraction procedure of local features for text classification.
Other Neural Network Models Recurrent Neural Networks (RNN) are naturally good at modeling variable-length sequential data and capturing long-term dependencies (Hochreiter and Schmidhuber, 1997;Chung et al., 2014). Global features are encoded by semantically synthesizing each word in the sequence in turn and there is no explicit small regions feature extraction procedure in this process. Lai et al. (2015) equip RNN with max-pooling to tackle the bias problem where later words are more dominant than earlier words. Tang et al. (2015) utilize LSTM to encode semantics of sentences and their relations in doc- Figure 1: Encoder1-Encoder2 architecture mainly contains three components.

Modes
(1) Encoder1 serves as a global information provider.
(2) Encoder2 is a local feature driven model whose output is directly fed into the classifier.
(3) Mode is the interaction manner between them. S and A are abbreviation of SAME and ATTEND respectively. ument representation. Tai et al. (2015) introduce a tree-structured LSTM for sentiment classification. The attention mechanism proposed by  has achieved great success in machine translation (Vaswani et al., 2017). For text classification which only has single input sequence, attention based models mainly focus on applying attention mechanism on top of CNN or RNN for selecting the more important information (Yang et al., 2016;Er et al., 2016). Letarte et al. (2018) and Shen et al. (2018) also explore self-attention networks which is CNN/RNN free.

Overview
In this paper, we propose a novel neural network architecture named Encoder1-Encoder2 for text classification, which is illustrated in Figure 1. The identical input sequence will be encoded twice by two encoders respectively, but only the output of Encoder2 is used directly for the classifier. In particular, the Encoder1 serves as a pioneer for providing global information, while the Encoder2 focuses on extracting better local features by incorporating the former into the local extraction procedure. Besides, two Interaction Modes are developed for more targeted absorption of global information.

Encoder1: Global Information Provider
Without loss of generality, we introduce three types of models for Encoder1 in our architecture, each of which can be an independent global information provider and they are compared in our experiments.
CNN Let x t be the d-dimensional word vector corresponding to the t-th word in a sequence of length n, x t h+1:t refers to the concatenation of words x t h+1 , x t h+2 , . . . , x t with size h and k number of filters are applied to the input sequence to generate features. Formally, filters W f are applied to window x t h+1:t to compute h t : By same padding, filters are applied to n possible windows in the sequence and the global representation can be represented as enc 1 : GRU Gated recurrent units (GRU) are a gating mechanism in RNN . Two types of gates are used in GRU: reset gate decides how much new information is updated, while update gate controls the flow of previous information. The hidden state h t is computed iteratively based on h t 1 and x t . As a result, the all previous information can be encoded. For saving space, here we abbreviate it as: The global representation produced by GRU is hidden states of all time steps: Attention We also introduce attention mechanism on GRU for enhancing valuable information following Zhou et al. (2016). Define a context vector u w to measure the importance of each hidden state h t in GRU, which is randomly initialized and learned during training. A normalized importance weight ↵ t is obtained through a softmax function: The global representation produced by this attention mechanism is expressed as:

Encoder2: Variant Local Extractor
Vanilla local feature extractor strictly focuses on a limited size region. Here we propose a variant method. Apart from the expected local context, global information distilled by Encoder1 is also absorbed by a local extractor. In this way, the local features extracted by Encoder2 can notice the global background while still maintaining the position-invariant local patterns. For Encoder2, we introduce two kinds of local feature driven models, i.e., CNN and DRNN. The former is good at capturing local spatial structure, while the latter is highlighted in capturing local temporal part. Set g t as the required global information for a certain size window starting from x t , which will be introduced in 3.4 in detail.
CNN Here we treat each g t 2 R d as a faked extra global word, and do convolution with window words together. Based on Equation 1, features produced by filters for window x t h+1:t can be represented as: DRNN Different from CNN, DRNN utilizes RNN to extract local features for each window (Wang, 2018). To introduce global information into DRNN, faked global word g t is filled in the head of each window like CNN does. Because of the sequential nature of RNN, even for a limited window, global information can be encoded into RNN from scratch and motivate the latter words.
Here we use GRU as the local feature extractor, and features produced for window x t h+1:t can be represented as: To maintain translation invariant, a max-overtime pooling layer is then applied to CNN or DRNN layer, the pooling result is regarded as the output of Encoder2:

Interaction Modes between Encoders
Set enc 1 as the global representation produced by Encoder1, required information for a certain window x t h+1:t with size h is defined as g t : where G is a function of interaction mode. Two modes are devised from different point of views. SAME Treat enc 1 as a "reference book" provided by Encoder1. The basic idea of SAME Mode is each window in Encoder2 will get indiscriminate guidance regardless of the local information itself. For this purpose, max-over-time pooling is operated on enc 1 directly to extract the most important information: ATTEND Mode ATTEND utilizes global information from another perspective. According to different local contexts, the guidances from En-coder1 can be be more targeted. Specifically, we use attention mechanism.
For window x t:t+h 1 with size h, the context vector is the average pooling of local words embeddings and the importance weight ↵ t for each hidden state h t in Encoder1 can be computed as: To maximize the profits obtained from En-coder1, we concatenate both of maxpooling results and attention results. Thenĝ t in ATTEND mode can be represented as: Finally, to keep consistent dimensions with words in the text, we compressĝ t using MLP and formalized as g t , which can be easily embedded into the local feature extraction in Encoder2.

Classification Layer
After incorporating the global information obtained from Encoder1 into the local feature extraction of Encoder2, the output vector of latter can be regarded as the representation of the entire text. The vector is then fed into a softmax classifier to predict the probability of each category and cross entropy is used as loss function: whereŷ i is the predicted probability and y i is the true probability of class i.

Experiments
We report experiments with proposed models in comparison with previous methods.

Experiments Settings
Datasets Publicly available datasets from Zhang et al. (2015) are used to evaluate our models. These datasets contain various domains and sizes, corresponding to sentiment analysis, news classification, question answering, and ontology extraction, which are summarized in Table 2.
Model Settings For data preprocessing, all the texts of datasets are tokenized by NLTKs tokenizer (Loper and Bird, 2002 Table 3, all trainable parameters including embeddings of words are initialized randomly without any pre-trained techniques (Mikolov et al., 2013;Peters et al., 2018;Devlin et al., 2018).

Training and Validation
For each dataset, we randomly split the full training corpus into training and validation set, where the validation size is the same as the corresponding test size. Then the validation set is fixed for all models for fair comparison. The reported test accuracy is evaluated in the model which has lowest validation error. AdaDelta (Zeiler, 2012) with ⇢ = 0.95 and ✏ = 1e 6 is chosen to optimize all the trainable parameters. Gradient norm clipping is employed to avoid the gradient explosion problem. L2 normalization is used in all models which include RNN structures. The batch size is set to 64 for Yelp P. and Yelp F. while 128 for other datasets. We train all the models using early stopping with timedelay 10. Table 4 is the summary of the experimental results. We use underscores to represent the best published models, and bold the best records. Best models in our proposed architecture beat previous state-of-the-art models on all eight text classification benchmarks. For published models, best results are achieved almost all by local feature driven models including Region-emb, VDCNN and DRNN. Self-Attention model SANet performs well, but does not achieve advantageous results as in sequence to sequence   Table 5. For compared previous models, first block lists n-grams based models including bigram-FastText (Joulin et al., 2016) and region embedding (Qiao et al., 2018). Self-attention Networks SANet (Letarte et al., 2018) is reported in the second block. RNN based models LSTM (Zhang et al., 2015), D-LSTM (Yogatama et al., 2017) and CNN based models char-CNN (Zhang et al., 2015) and VDCNN (Conneau et al., 2016) are listed in third and forth block respectively. Strong local feature driven models CNN (Kim, 2014) and DRNN (Wang, 2018) are chosen as base model and directly compared with our architecture in last two blocks.

Experimental Results and Analysis
scenes, neither do RNN based methods. We argue that it is because key phrases and word order play an important role in text classification. For our models, the experimental results show that enhanced local extractors with global encoder outperform vanilla local models by a advantageous margin. When CNN is chosen as local extractor, the performance gains are particularly significant for relatively difficult tasks such as Amz. F.(+2.4%) and Yah. A.(+2.0%). Encoder1-CNN performs even better than VDCNN with 29 convolutional layers. The results are satisfying considering that our CNN used as local extractor here is a shallow model with only one layer. Moreover, complicated VDCNN performs best among published models on larger datasets Amz. P.(95.7%) and Amz. F.(63.0%) but not as expected on smaller AG(91.3%), while our Encoder1-CNN has stable superior performance on all datasets. When DRNN is chosen as local extractor, the bonus from the global encoder is not so big like CNN, but still considerable and stable. Encoder1-DRNN beats DRNN on all datasets with a highest gain up to 0.7%.
To better analyze the impact of specific En-coder1 (global encoder) and different Interaction Modes on architecture performance, Table  5 details all combinations results of Encoder1-Encoder2-Mode on three datasets. We find the local extractor benefits quite a lot for any introduced global encoders. Overall, RNN and Attention based global encoders perform well-matched for local extractor, and both of them often perform better than CNN based global encoder. For example, Attention-CNN wins CNN-CNN 1.0% on Yelp F. and RNN-DRNN wins CNN-DRNN 0.4% on Yah. A. This is intuitive since RNN and Attention are more appropriate in capturing global information compared with CNN, which is critical for local extractor. The structures which specialize in modeling long-term dependency are more recommended as the global encoder.
For two Interaction Modes, we find ATTEND performs slightly better than SAME up to 0.4%, which can verify the differentiated motivation. En-coder1 (global encoder) can be viewed as a "reference book" about the whole text. Two Modes utilize the information from different perspectives to    assist the local extractor. SAME Mode selects the most important information of global encoder and provides same guidance for each window in En-coder2, while the ATTEND Mode tends to make use of the "reference" with purpose based on different local contexts as if we refer to a reference book with initiative questions.

Model Ablation
In addition to introducing another encoder into vanilla local feature driven models, the greatest novelty of our architecture lies in that the global encoding is used to generate local features directly. Based on this motivation, the local features have the global awareness from the very beginning. To verify that our novel architecture makes key contribution to the performance improvement, we carry out model ablation experiments. Without loss of generality, we use CNN as local extractor here and validate on Yelp F. and Yah. A.
datasets. The results are illustrated in Figure 2.
Firstly, we list the results of Vanilla CNN, which is regarded as the most primitive state. Secondly, another additional encoder is introduced but they both process inputs independently and then their output representations are concatenated for classification. We call it "Concat", abbreviated as "C". For example, RNN-CNN-C stands for concatenating another RNN. Finally, we upgrade the way to use the introduced encoder as our proposed architecture. Here we list Mode SAME, abbreviated as "S".
We find CNN-CNN-C loses 0.5% on Yelp P. but wins 0.4% on Yah. A. compared with vanilla CNN. CNN-CNN-C can be viewed as doubling convolution filters and we can observe that introducing more parameters does not always perform better. Meanwhile, RNN-CNN-C wins vanilla CNN 0.7% on Yelp P. and 0.9% on Yah. A. It makes sense since the classifier could use features from CNN and RNN simultaneously and different model structures complement each other for classification. In particular, our architecture performs best for both cases. CNN-CNN-S wins CNN-CNN-C 1.3% and 0.8%, and RNN-CNN-S wins RNN-CNN-C 0.6% and 1.0% on Yelp P. and Yah. A. respectively. In fact, CNN-CNN-S does not introduce new model structure or complicated operations and the number of parameters are almost the same. We attribute the great improvement to our novel mechanism where the global representation conduces to the local extraction.

Effect of Window Size
As an important hyperparameter, window size determines how much information can be seen in a specific window and often requires carefully tun-  CNN "Commission backs 5bn British Energy deal "" British Energy, the nuclear generator yesterday welcomed a decision by the European commission to approve a government-backed 5bn rescue plan ." World 5 ATT-CNN "Commission backs 5bn British :::::: Energy :::: deal "" British :::::: Energy, the :::::: nuclear generator, yesterday welcomed a decision by the European commission to approve a government-backed 5bn ::::: rescue ::::: plan ." Business 3

CNN
The mac and cheese sticks were amazing ... highly recommend them . Overall, for the high price price pay here, I would rather be across the casino with at least a great view of the fountains.

ATT-CNN
The mac and cheese sticks were ::::::: amazing ... highly :::::::::: recommend them . ::::::: Overall, for the :::: high ::::: price you pay here, I :::::: would ::::: rather be across the casino with at least a :::: great view of the fountains. Negative 3 ing in traditional method. Small window sizes may result in the loss of some critical information whereas large windows result in an enormous parameter space, which could be difficult to train (Lai et al., 2015). In this section, we analyze the impact of different window sizes on model performances. As shown in Figure 3, both CNN and DRNN are very sensitive to window size, the optimal window size in DRNN can be much larger than CNN due to the sequential memory in RNN structure. Tuning these models is often challenging. In contrast, our Encoder1-Encoder2 architecture is insensitive to the parameter and achieves stable satisfied performance in various window sizes. We believe it is because the local extraction has been enhanced by global information and not strictly dependent on large windows to capture long range information.

Case Study and Visualization
To investigate how our architecture makes a difference in details, we visualize the attending phrases by the neural model in Table 6. Qualitatively, we display the contribution of phrases in Encoder2 to classification via max-pooling. The most important phrases are highlighted red where the intensity of the color indicates the contribution. Meanwhile, we use waves to roughly indicate the key phrases with high attention scores in Encoder1. Detailed visualization techniques have been introduced in Li et al. (2015).
The first two lines compare CNN with our Attention-CNN on an example from AG News. CNN wrongly captures key phrases British Energy and the nuclear generator and thus misclassifies the example into World. In contrast, our Attention-CNN is able to correctly classify it into Business. The Encoder1 firstly captures the global description by Energy deal, nuclear, and rescue plan. Informed with these global information, Encoder2 reduces its attention to nuclear, which implies label World while captures key phrases British Energy deal and 5bn rescue plan. Accordingly the model makes a correct prediction labeled as Business. For the second example, the global representations include phrases high price and conjunction Overall, making Encoder2 activate I would rather while reduce its sensitivity to highly recommend them compared with CNN.
In short, the global representations learned by Encoder1 provide a brief overall grasp of the whole text, which includes both semantic and structure information. It effectively helps En-coder2 capture better instance specific local features and improve model performance.

Comparison to Pre-trained Models
As shown above, our paper mainly focuses on fully-supervised domain where all model parameters are trained from scratch. Alternatively, substantial work has shown that pre-trained models are beneficial for various NLP tasks. Typically, they first pre-train neural networks on large-scale unlabeled text corpora, and then finetune the models or representations on downstream tasks.
One kind of pre-trained models is the word embeddings, such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). More recently, by utilizing larger-scale unsupervised corpus and deeper architecture, pre-trained language models have shown to be effective in learning common language representations and have achieved great success. Among them, OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2018), XLNet (Yang et al., 2019) and ERNIE 2.0 (Sun et al., 2019) are the most remarkable examples.
In this section, we generalize our architecture to semi-supervised domain which is equipped with pre-trained word embeddings and then compare with popular pre-trained based models. Specifically, we use GloVe vectors 2 with 300 dimensions to initialize the word embeddings in our architecture. BERT BASE 3 and ERNIE 2.0 BASE 4 with 12layer Transformer (Vaswani et al., 2017) are chosen for comparison. Here we report best model for each specialized Encoder2 with SAME Mode. Results on three datasets are listed in Table 7.
Overall, our architecture can be further boosted a lot by utilizing pre-trained word embeddings. For example, Encoder1-DRNN-S obtains a new score of 76.2%(+1.4%) on Yah. A. and Encoder1-CNN-S gets 94.1%(+1.6%) on AG. Vanilla local extractors also achieve better performance as expected in most instances while our models are still much better than them. Encoder1-CNN-S outperforms CNN by 0.9%, 1.8% and 2.0% on three datasets respectively, and Encoder1-DRNN-S outperforms DRNN by 0.4%, 0.6% and 0.7%. It shows that our architecture is well generalized and compatible with pre-training techniques.
It is interesting to compare with stronger pretrained models. Although we obtain close scores on AG, BERT and ERNIE 2.0 indeed achieve 2 http://nlp.stanford.edu/projects/glove 3 https://github.com/google-research/bert 4 https://github.com/PaddlePaddle/ERNIE  Table 7: Semi-supervised generalization of our architecture and comparison with popular pre-trained models. Here "n" and "y" stand for initializing word embeddings randomly and with pre-trained GloVe vectors separately.
more advanced results on others and the latter performs best on all three datasets. Despite their superb accuracy, we argue that the huge models are resource-hungry in practice. Lightweight models still have advantages under some circumstances such as limited memory, longer text data to be processed and requirements of faster inference time.

Conclusion
In this work, we demonstrate the local feature extraction can be significantly enhanced with global information. Instead of traditionally exploiting deeper and complicated operations in upper neural layers, our work innovatively provides another lightweight way for improving the ability of neural model. Specifically, we propose a novel architecture named Encoder1-Encoder2 with two Interaction Modes for their interacting. The architecture has high flexibility and our best models achieve new state-of-the-art performance in fullysupervised setting on all benchmark datasets. We also find that our architecture is insensitive to window size and enjoy a better robustness. In future work, we plan to validate its effectiveness for multi-label classification. Besides, we are interested in incorporating more powerful unsupervised methods into our architecture.