Learning distributed sentence vectors with bi-directional 3D convolutions

We propose to learn distributed sentence representation using text’s visual features as input. Different from the existing methods that render the words or characters of a sentence into images separately, we further fold these images into a 3-dimensional sentence tensor. Then, multiple 3-dimensional convolutions with different lengths (the third dimension) are applied to the sentence tensor, which act as bi-gram, tri-gram, quad-gram, and even five-gram detectors jointly. Similar to the Bi-LSTM, these n-gram detectors learn both forward and backward distributional semantic knowledge from the sentence tensor. That is, the proposed model using bi-directional convolutions to learn text embedding according to the semantic order of words. The feature maps from the two directions are concatenated for final sentence embedding learning. Our model involves only a single-layer of convolution which makes it easy and fast to train. Finally, we evaluate the sentence embeddings on several downstream Natural Language Processing (NLP) tasks, which demonstrate a surprisingly excellent performance of the proposed model.


Introduction
Mapping documents or sentences to vectors (Le and Mikolov, 2014;Pagliardini et al., 2018) is the foundation of various natural language processing (NLP) tasks, such as text classification (Kim, 2014), paraphrase detection (Socher et al., 2011), natural language inference (Bowman et al., 2015), question answering (Zhou et al., 2015), etc. The most straightforward approach to sentence representation uses the bag-of-words model that represents a sentence as a bag of its constituent words, disregarding grammar and even the word order but keeping multiplicity. Another similar approach is called Glove (Pennington et al., 2014), which takes the average of word vectors of the constituent words of a sentence. These approaches are typically efficient to train, while they ignore the sequential characteristic of the text. To account for the word order, Skip-Thought (Kiros et al., 2015) learns sentence representation in an unsupervised way inspired by the skip-gram. It aims to predict the neighboring sentences or phrases for a given sentence. However, the training process of the Skip-Thought is very slow, which motivates the FastSent (Kenter et al., 2016) to speed up the training by representing a sentence as a sum of its constituent word vectors. Although FastSent is faster than Skip-Thought in training, it sacrifices the order of words in a sentence, which is important in language models, such as the n-gram feature. For example, Gupta et al. (2019) utilize the bi-gram and even tri-gram to train their embedding model.
As discussed above, most existing works of sentence representation require pre-trained word vectors as the input or initialization. Sentence representation is taken as a downstream task of word representation. However, when human beings read a sentence or an article, their eyes in fact receive a series of text images which are then passed to the brain for recognition and understanding. Hence, a natural way of word representation is to use visual shapes of the words or characters as features directly (Shimada et al., 2016;Su and Lee, 2017;Liu et al., 2017;Sun et al., 2019;Liu and Yin, 2020). For example, Su and Lee (2017) and Shimada, Kotani, and Iyatomi (2016) take Chinese and Japanese characters as images and apply a subsequent convolutional autoencoder to take those images as input and then output lowdimensional character embeddings. Liu and Yin (2020) extract both the forward and backward n-gram features from the text's pixel embedding.
We propose to learn the sentence embeddings using the non-pretrained word images but fold them into sentence images as input. Our model utilizes multiple bi-gram, tri-gram, quat-gram even five-gram embeddings of both forward and backward orders of words. Current research in NLP tends to use deep and complex models, which make the performance compromise to the model complexity. However, the proposed model has a lightweight structure as shown in Figure 1. In detail, we render words or characters of a sentence into images and then fold them into a 3-dimensional sentence tensor X ∈ R w×h×l , where w ×h is the size of the word or character image and l is the length of the sentence. Each slice X i ∈ R w×h corresponds to a word or a character image. Furthermore, we propose to fully exploit the language feature (i.e., the word order in a sentence) with two distinctive strategies: (1) extracting the multiple n-gram features with several 3-dimensional convolutional kernels of different sizes (n is the number of words covered by the kernel); (2) learning both the forward and backward semantic information from the sentence with bi-direcitonal convolutions, as shown by Figure 1. We name the proposed model as 3D-ConvLM . We choose multiple n to integrate multiple n-gram convolutional kernels. Taking the demo in Figure 1 as an example, we use bi-gram, trigram, and quad-gram information together. Moreover, these n-gram information are constructed from both the normal text order and the reverse order through bi-directional convolutions. A subsequent 1dimensional max-over-time pooling is applied to each channel of this 2-dimensional feature map output by each n-gram model with different channels. After pooling, feature maps from each n-gram model of two directions are concatenated as the output feature of the convolutional layer. Finally, three fully connected (FC) layers are used for conducting text embedding learning.
The contributions of our work are three-fold: (1) We propose to represent a sentence or an article with a video-like 3-dimensional tensor, and each frame of this tensor represents one word in the sentence or article, which provides an alternative view to understand the NLP with computer vision techniques.
(2) We use a 3-dimensional convolutional kernel to learn the n-gram features from the text tensor.
(3) We propose to use bi-directional convolutions to extract semantic information on both the text's forward and backward orders.
The proposed 3D-ConvLM extracts and integrates multiple n-gram features during forward and backward convolutions, which further increases the flexibility of the input of text information. We evaluate 3D-ConvLM on text classification and sentence matching, and study the difference between traditional Chinese and simplified Chinese under the proposed framework.
2 Related Works

Sentence embedding
Transforming a document or a sentence into a numerical vector (i.e., embedding) according to the text's semantic meaning represents a fundamental task in downstream applications of NLP. One simple implementation of the sentence representation is to sum or average all its constituent word embeddings, such as the bag-of-words, Glove (Pennington et al., 2014), and FastSent (Kenter et al., 2016). These methods are typically efficient in training but compromise the order of words, which may cause significant information loss for text analysis.
Research works have been carried out to model the order of words when learning the distributed sentence representation (Le and Mikolov, 2014;Kiros et al., 2015;Conneau et al., 2017;Pagliardini et al., 2018;Gupta et al., 2019;Shen et al., 2019). Le and Mikolov propose Doc2vec (Le and Mikolov, 2014) to add a paragraph vector to represent the missing information from the current context. The Doc2vec is an adaptation of the Word2vec (Mikolov et al., 2013). Also inspired by the Word2vec model, the Skip-Thought (Kiros et al., 2015) learns sentence representation by predicting the neighboring sentences for any given sentences. Sent2Vec (Pagliardini et al., 2018) aims to strike a balance between matrix factorization and deep learning. Gupta et al. (2019) propose two modifications of Word2vec by considering higher-order word n-grams along with uni-gram during training. Shen et al. (2019) use InferSent (Conneau et al., 2017) for sentence embeddings based on word vectors learned by Glove (Pennington et al., 2014) or FastText (Joulin et al., 2017). Gupta et al. (2019) claim that training word embeddings along with higher n-gram embeddings helps in the removal of the contextual information from the uni-gram, resulting in better stand-alone word embeddings. All the aforementioned methods require pre-trained word vectors as input.
From a completely different perspective, the most natural way of representing text is using its visual shape, which is also how human understand the text. The pre-trained word vectors are not indispensable for sentence embedding. Intuitively, when we read an article on a screen or a book, our eyes capture the text as a series of images rather than embedding them into vectors. In other words, human understand the text with the visual information of the words, i.e., we recognize characters or words from their images that are captured by our eyes. Therefore, we believe that the pixel image, i.e., the character's morphological shape, provides the most straightforward way to represent characters and words. Motivated by this idea, several visual embedding methods (Shimada et al., 2016;Su and Lee, 2017;Sun et al., 2019) have been developed for Chinese and Japanese text understanding. However, it is very difficult to visually embed alphabetic languages such as English, because English words cannot be rendered as the same sized image as Chinese or Japanese square characters. In this work, we render English words into fixed size images. By contrast, for Chinese, each character is rendered into a squared image. Based on visual embedding, we integrate multiple n-gram embeddings of both forward and backward directions into our model.

Bi-directional models
In neural language models, both the normal order and reverse order are preferred to be used as the input. The most well-known bi-directional model is the Bi-LSTMs (Schuster and Paliwal, 1997), which accepts both the forward and backward information of text as the input. For example, Melamud et al. (2016) apply the Bi-LSTMs to a generic context embedding function from large corpora. Kawakami and Dyer (2015) represent words in the context using Bi-LSTMs and multilingual supervision.
The forward direction on the input sequence follows the order as it is and the backward applies on a reversed copy of the input sequence. The use of Bi-LSTMs may not always benefit for all sequence prediction problems, but it can improve the results in the domains where it is appropriate (Graves and Schmidhuber, 2005;Melamud et al., 2016;Kawakami and Dyer, 2015).
Substantial research works have been focused on combining the Bi-LSTMs with convolutional neural networks (CNNs). For example, the gated bi-directional CNN (Zeng et al., 2016) is a bi-directional network, which can effectively make use of multi-scale and multi-context regions of images. It is motivated by the fact that features from different resolutions and support regions can validate the existence of one another. One example is that a local rabbit ear is helpful in recognizing the rabbit from an image. However, when the local feature is that the rabbit ear is artificially located on a girl's head, it would validate the evidence to support a rabbit image. Hence, we cannot apply the gated bi-directional CNN to NLP problems. Chiu et al. (2016) propose the bi-directional LSTM-CNNs, which can automatically detect word-and character-level features using a hybrid bi-directional LSTM and CNN architectures, and thus eliminate the need for most feature engineering. In contrast, 3D-ConvLM is focused on exploiting the bi-directional semantic knowledge from the text directly.

Overview
Our model makes a direct connection between computer vision and NLP. We render a sentence S into a 3-dimensional tensor, where each slice of this tensor corresponds to a word (for English) or a character (for Chinese), i.e., each word from S is rendered as an image X i ∈ R w×h . We then apply 3-dimensional convolutional kernels of size w × h × n to the "text tensor", where w and h are respectively the width and height of the character images, and n is the number of characters. In other words, the 3-dimensional convolution operates n words or characters for one slide, which acts as an n-gram feature detector. The 3-dimensional convolutional kernel used here is different from the kernel in the traditional image or video processing tasks. By varying the values of n, we can obtain different n-gram detectors of different sizes. We propose to extract textual features using multiple n-gram convolutions. For example, in our experiments, n can take values of {2, 3, 4, 5}. Integration of multiple n-grams is very easy and fast to implement under the proposed framework.
In neural language modeling, textual information of the normal order and reverse order are two different inputs. For example, the Bi-LSTMs model takes both forward and backward sequences of the text as inputs. Both the forward on the input sequence as it is and the backward on a reversed copy of the input sequence are used. The integration of bi-directional information may not always benefit for all sequence prediction problems, but it can offer some improvement in those domains where it is appropriate (Graves and Schmidhuber, 2005). Traditional CNNs ignore the difference of the sequential information between the forward and backward information. In this work, we propose both the forward and backward convolutions to bridge this gap. As shown in Figure 1, three n-gram models (bi-gram, trigram, and quad-gram 1 ) have been applied to the forward and backward text inputs. Hence, each n-gram detector from both the forward and backward models would output two 2-dimensional feature maps, in which the rows correspond to channels and the columns correspond to the n-gram features. A subsequent 1-dimensional max-over-time pooling is applied to each channel of these 2-dimensional feature maps. The max-over-time pooling means this procedure is implemented along the time dimension according to the order of a sentence, which is different from the max-over-time pooling referred in Kim et al. (2016) that takes max pooling over different convolutional kernels. After pooling, the feature maps extracted by each n-gram detector of two directions are concatenated as the output feature of the convolutional layer. Finally, three FC layers are used for conducting downstream NLP tasks, such as text classification. Figure 1 illustrates this entire process of the proposed model.

Discussion
CNNs have been shown to achieve excellent performance on text classification and sentiment classification (Kim, 2014). For the standard CNN on text analysis, only the forward convolution is passed to the next layer. On the contrary, we propose bi-directional convolutions to adapt our model to text data as shown in Figure 1. We add the following two more important characteristics into the CNN to make it better for text analysis: (1) We use bi-directional convolutions to extract features; and (2) we integrate multiple n-gram features as the input.

Network Implementation
The network architecture can be described in order as follows: 1. The specification stride = 3 for the MaxPool1d results in no overlaps in the max-over-time pooling. The kernel size of the Conv3d layer is (20, 131, 3). It corresponds to a tri-gram model, which is different from the bi-gram model illustrated in Figure 1.
The proposed model has a light-weight architecture, which only contains no more than 20 * 131 * 3 * 50 + 50 * (L − 6) * 1250 * 512 * 100 * num class, num class is the number of the category of each dataset. The number of parameters of 3D-ConvLM is much fewer than that of the baseline char-CNN (Zhang et al., 2015).

Experimental setup
We render characters (in Chinese) or words (in English) into images of size R w×h . Chinese characters have squared shapes, and thus can be represented by squared images with 20 × 20 = 400 pixels; that is, w = h = 20. For English words, we render each word into images, and set the height h = 20 and the width w = 131 to satisfy most of the English words with a maximum length of 17 alphabets. Our model is trained using the Adam (Kingma and Ba, 2015) with a learning rate of 0.00001.

Datasets for text classification
The three datasets for text classification are introduced as follows. The THUCNews dataset 2 (Li and Sun, 2007) is generated according to the historical data obtained from the subscription channel of Sina News RSS (https://rss.sina.com.cn/) from year 2005 to 2011. The Toutiao news dataset collects the text from the Toutiao App 3 , and each item contains the title and keywords of the news. We concatenate the title and keywords as one sample where samples of lengths shorter than 5 are removed and thus the remaining 380,455 samples are used for training and testing (https://github.com/ fate233/toutiao-text-classfication-dataset). There are 382,688 item documents in the raw data. We filter out sentences with lengths outside of the range [5, 100] and, as a result, there remain 380,455 item documents, i.e., 99.41% sentences with their lengths falling in the range of [5,100]. The Dianping dataset consists of user reviews from online restaurants (http://www.dianping. com/) (Zhang and LeCun, 2017), which contains 2,500,000 samples. Each sample is a review with a score ranging from 1 star to 5 stars. We mark a review as a positive sentiment if the star value equals 4 or 5, and negative otherwise.
For the three datasets, the sample sizes for training, validation, and testing are given in Table 1.

Baselines
For comparison, we consider four baseline methods described as follows: • The character-level convolutional neural networks (char-CNN) (Zhang et al., 2015); • CNN for text classification on top of the distributed word vectors obtained via Word2vec (Kim, 2014); • CNN for text classification on top of the one-hot word vectors; • FastText (Joulin et al., 2017).
We experiment with three variants of 3D-ConvLM: • 3D-ConvLM using both the bi-directional convolution and multiple n-grams inputs, as shown in Figure 1.
• 3D-ConvLM using the bi-directional convolution and 3-gram detector with the filter size w × h × 3.
• 3D-ConvLM using only the 3-gram which has two convolutional layers and the filter size is w × h × 3.

Results on text classification
Testing results on the accuracy of our models in comparison with other methods are shown in Table 2. The proposed model with bi-directional convolutions and {2, 3, 4, 5}-gram achieves superior performances on the text classification compared with the others. It is worth emphasizing that both the bidirectional convolution and multiple n-gram detector contribute to the final performance, which can be supported by the experimental results of the first three rows of Table 2. By comparing the results in row 2 and row 3, we observe that the model with bi-directional convolution in row 2 indicates better accuracy than the model in row 3. Results in row 1 and row 2 in Table 2 also suggest that integrating multiple n-gram improves the performance.

Interpretation of 3-dimensional convolution
The 3-dimensional convolutional kernel acts as an n-gram detector in our model. As shown in Figure  1, the conventional kernel operates on n frames (i.e., n words) at a time, which thus corresponds to an n-gram detector. For a sentence S of length l, we can generate a feature vector u ∈ R l−1 , which is a continuous n-gram feature of S. By applying k different kernels, we obtain a feature map U ∈ R k×(l−1) for each n-gram detector. When carrying out the testing, we also render a test sentence S = {v 1 , v 2 , . . . , v l } into a text video X = {X 1 , X 2 , . . . , X l }. We input the text video X , which outputs a corresponding feature map U. Its element U i,j corresponds to the convolutional result between the i-th 3-dimensional convolutional kernel and the j-th n-gram. A larger value of U i,j indicates that the j-th n-gram of the input sentence S is more relevant for the classification task (selected by kernel i). By identifying the maximum U i,j of all elements in U, we can easily find out the most relevant n-gram for the task of classification within the sentence S, where j + 1 ≤ l, and l is the number of words in S. We visualize the weighted bi-grams according to the first layer of the network trained on the task of topic classification of the dataset THUCnews. It has ten classes as shown in Table 3. There are 10,000 testing samples for all categories. Table 3 illustrates the top five bi-grams associated with each category.
Almost all the top five bi-grams detected by the proposed model (bi-directional + {2, 3, 4, 5}-grams) are relevant to the corresponding topics. Taking the topic "Finance" as an example, the detected bigrams, "基金" in Chinese and "fund" in English, are strongly matched with its semantic topic. We mark it in blue. We also see that the third bi-grams are "记者" in Chinese and "journalist" in English, which may or may not have any relationship with finance so that we color it in orange. The fifth most relevant bi-grams of "Finance" suggested by 3D-ConvLM is "可能", which however is not relevant and we color it in red.
Because 3D-ConvLM does not need to conduct preprocessing, such as segmentation (for Chinese), some special bi-grams without any semantic meaning could be detected, for instance, "图) ", "》报", "月2", and "》新". Although these bi-grams are meaningless, they do exist in the corpus with high frequency. Chinese characters are possibly the oldest continuously used but most complex systems of writing in all languages. There are two coexisting writing systems, i.e., simplified Chinese and traditional Chinese. The two writing systems are used to write almost all Chinese dialects 4 . As shown in Figure 2, the upper row illustrates characters of simplified Chinese, and the lower row displays the same characters but in traditional Chinese, which clearly have more strokes than the simplified Chinese. Over the years, there have been extensive debates about traditional and simplified Chinese. For example, what are the differences between traditional and simplified Chinese? And which one is more efficient?

學而時習之，不亦說乎？ 学而时习之，不亦说乎？
In this section, we compare the differences between the simplified Chinese and traditional Chinese under the framework of 3D-ConvLM. We render the three datasets into both simplified Chinese and traditional Chinese, then run the text classification tasks on them for 10 times. From the results in Table  Table 3: Top five bi-grams of 10 different topics from the Chinese dataset THUCNews by the proposed model: bi-directional + {2, 3, 4, 5}-gram. Each one is listed in the format of [bi-gram in Chinese]+(its English translation)+(frequency). We use three different colors to indicate the relevancy between the topic and bi-grams: blue means strongly related, orange means possibly relevant, and red means irrelevant.

Topic
Top five bi-grams for 10 different topics of THUCNews based on frequency.

Sentence matching
In this section, we evaluate the relatedness and entailment relation between two sentences. The relatedness and entailment relation are defined based on a sentence pair (S A , S B ). The relatedness is a 5-point score that quantifies the degree of semantic relatedness between sentences; the entailment relation between S A and S B could be: entailment, contradiction, and neutral. To adapt the proposed 3D-ConvLM to these two tasks, we modify it by inputting sentence (S A , S B ) separately. Then multiple n-gram detectors are applied to S A and S B to output two separate feature maps. The SICK dataset (Marelli et al., 2014) contains about 10,000 English sentence pairs. Each pair was annotated for relatedness (SICK-R) and entailment (SICK-E) by means of crowdsourcing 5 . Samples in STS14 (Agirre et al., 2014) are also labeled as well as the SICK-R that contains 36000 sentence pairs. We run the following baselines with the SentEval (Conneau and Kiela, 2018), which is a sentence embeddings evaluation toolkit 6 .

Results of sentence matching
The results of sentence matching are shown in Table 5. For SICK-E and SICK-R, we see that the proposed model achieves comparable results with the FastText+BoW, while Skip-Thought and Glove+BoW perform as the best two. But for the STS14 set, our model achieves the highest Pearson/Spearman correlations. Although the proposed model does not present a remarkable performance on all the tasks, it has a light-weight structure for training.

Conclusion
Visual embedding of the text has been studied extensively in recent years. CNNs can deliver a competitive performance in comparison with LSTM on text data analysis. We propose a bi-directional CNN to learn text embedding according to the semantic order of sentences. In our model, visual signals of each character can be extracted by multiple n-grams in both the normal order and reversed order. We conduct text classification and sentence matching on several datasets to evaluate the performance of our model. Within the proposed framework, we also study the difference between the simplified Chinese and traditional Chinese.