Learning Distributional Token Representations from Visual Features

In this study, we compare token representations constructed from visual features (i.e., pixels) with standard lookup-based embeddings. Our goal is to gain insight about the challenges of encoding a text representation from low-level features, e.g. from characters or pixels. We focus on Chinese, which—as a logographic language—has properties that make a representation via visual features challenging and interesting. To train and evaluate different models for the token representation, we chose the task of character-based neural machine translation (NMT) from Chinese to English. We found that a token representation computed only from visual features can achieve competitive results to lookup embeddings. However, we also show different strengths and weaknesses in the models’ performance in a part-of-speech tagging task and also a semantic similarity task. In summary, we show that it is possible to achieve a text representation only from pixels. We hope that this is a useful stepping stone for future studies that exclusively rely on visual input, or aim at exploiting visual features of written language.


Introduction
Language representation beyond the word level can be advantageous for words from the tail of the distribution, as has been shown in recent neural approaches for various tasks (Schütze, 2017;Kim et al., 2016;Wu et al., 2016;Sennrich et al., 2016). In these approaches, a neural model represents an input text based on its sequence of characters or character n-grams (instead of its words). This helps the model to handle outof-vocabulary tokens and avoids the need of text segmentation or tokenization, which remains an unsolved problem for many languages. For some applications, e.g. language processing for social media, models should be able to capture the creative use of language. However, a disadvantage is that we loose the certainty of a known vocabulary. Also, regarding computational complexity, there is a trade-off between the memory that is required for large embedding lookup tables and the additional computational cost that is required to compose representations from low-level features.
In contrast to the characters of languages based on the Latin alphabet, the Chinese written language is defined over a set of ≈ 8000 characters that already carry meaning. The characters can either appear in a traditional or simplified form, and many share visual components that can indicate a related meaning. Thus, it is reasonable to hypothesize that encoding Chinese characters directly from their visual components might improve their token representation in a neural network model.
In recent studies, Liu et al. (2017) evaluated character encodings from visual features by classifying Wikipedia titles into 12 categories and Su and Lee (2017) evaluated such encodings by measuring correlation with human similarity judgments. Both studies found that using character encodings from visual input did not outperform or were equal to lookup-based embeddings. No support for the hypothesis above is thus provided.
In this study, we aim to explore the question of whether and when visual-feature representations are beneficial or detrimental. We make the following contributions: (i) Since the evaluation tasks from prior studies did not test the capabilities of the visual features for text representation, we propose to employ the task of neural machine trans-lation (NMT) for training and evaluation. We argue that NMT requires the token representation to serve as a reliable syntactic and semantic signal. (ii) Prior work reported evaluations for one architecture to encode the visual features. In this paper, we evaluate different settings and argue for architecture choices that performed well in our experiments. (iii) We provide evidence for the possibility to compute token representations from visual features that perform on-par with lookup-based embeddings in NMT. (iv) Finally, we revisit two of the tasks from prior work that use visual features: measuring correlation with semantic similarity judgments by humans as well as joint segmentation and part-of-speech tagging. We use the best models from the NMT evaluation in these two tasks. For semantic similarity, we find that token representation from pixels are clearly beneficial for unseen characters, while a lookup embedding performs better for seen characters. For joint segmentation and part-of-speech tagging we find no clear difference between lookup embeddings and character representation from pixels.

Related Work
We start by summarizing prior work, in which token representations were obtained from bitmaps of Chinese characters. Su and Lee (2017) propose several embedding models for Chinese, including one from visual features. For the evaluation they create a dataset with Chinese word pairs in traditional Chinese. In their evaluation, they measure the correlation of the cosine similarity of embedded word pairs against human similarity judgments for those word pairs. In their experiments, they found that the representation from visual features was not competitive to an embedding lookup model, but a combination of lookup embeddings with visual features was. For their model, they train a 5-layer CNN autoencoder on bitmaps of Chinese characters. The auto-encoder character representation is fed into two GRU layers and two fully connected layers with ELU activation to encode the characters into a word. The model was trained to predict the Skip-Gram objective (Mikolov et al., 2013) or the Glove objective (Pennington et al., 2014). Liu et al. (2017) evaluate the classification of Chinese Wikipedia titles into 12 categories. They found that the representation from visual features representations was not competitive to embedding lookup models. They did find, however, that a combination of lookup-based embeddings and visual features can be beneficial. Their character encoder is a convolutional neural network (CNN). The CNN consists of 3 convolutional layers with max-pooling followed by a fully connected transformation layer with ReLU activation. The character encodings are fed as input to a recurrent neural network (RNN) that encodes the Wikipedia title. Both, character encoder and classifier were trained jointly. Costa-jussà et al. (2017) investigate the neural machine translation from Chinese to Spanish. To investigate the helpfulness of visual features they augmented lookup embeddings by concatenating the corresponding bitmap features of the characters. They showed that this approach improved the performance of both, their character-based and their word-based NMT system. They did not study the exclusive use of visual features. Shao et al. (2017) investigate joint part-ofspeech tagging and segmentation for Chinese. In contrast to the studies mentioned above, they found that the inclusion of visual features did not help. However, in their study, the CNN encoder was not pre-trained and only a small amount of training data was available, which may explain this finding. In their approach, they augmented character and n-gram embeddings with visual features. The model is a 2-layer CNN with maxpooling and a consecutive fully-connected transformation layer. They concatenated the CNNbased character embeddings to the lookup embeddings and fed them to a BI-LSTM-CRF.
Other related approaches with regards to encoding symbols from visual features are: Deng et al. (2017) proposed to translate from images to a markup language; e.g.; images of mathematical expressions to L A T E X code. While their system performed remarkably well on those regular languages, they did not discuss their approach in the context of natural language. Remotely related is optical character recognition (OCR) and multimodel machine translation. While OCR could benefit from this study, its main goal is character recognition under noisy conditions. Multi-modal machine translation augments a main task with additional features or when the translation is from a non-textual source, e.g. images, to text (Elliott et al., 2017). In contrast, our goal is to study text representation from low-level features.  (Ioffe and Szegedy, 2015), F = CNN output size (see Table 2).

Character Encoding
In this section, we first describe how a tokenized text is mapped to vector representations of its tokens. Then, we describe the character encoders that we have evaluated.

Preliminaries
Neural approaches that compute a function from text input commonly map each token t i (e.g., a word) from the input text T to a dense vector rep- Lookup-based models associate each token of a fixed vocabulary V with its own dense representation (embedding). In particular, lookupbased models use (or learn) an embedding matrix E ∈ R n×d , where n = |V |. The j-th row of E holds the embedding of the j-th token in the vocabulary. Thus D(t i ) = e T j E, where j is the token number of token t i in V and e j denotes the j-th standard basis vector.
If we represent each character from its visual features-i.e., the pixels of its bitmap image-, we compute the embedding instead of performing a lookup in an embedding matrix. In particular, D(t i ) computes a dense representation of token t i from its pixels p(t i ) via a character encoder C. Therefore, there is no fixed vocabulary of tokens any more. We have D(t i ) = C(p(t i )).

Character Encoders
As discussed in Section 2, most of the prior related work-except Costa-jussà et al. (2017)used CNNs to learn position-invariant features of the character image. In Table 2, we report the configuration of the convolutional layers that was used in our models. All studies report different numbers of layers and different numbers of CNN features, therefore, we vary the CNN feature size F as a hyper-parameter. In Table 1 we report the character encoder architectures, that roughly cover the architectures from previous work. After the CNN, Liu et al. (2017) and Shao et al. (2017) use fully connected layers (CNN+FC+ReLU), while Su and Lee (2017) directly feed the CNN features into a recurrent neural network encoder (CNN+ReLU). We also consider a similar setup as Costa-jussà et al. (2017) by including two settings with one or two fully connected layers, (FC 1L and FC 2L). In addition to previous work, we also evaluated a model architecture that computes a softmax activation over the image features (CNN+SM+FC). Our hypothesis is that the sparse activation from the softmax may act like a soft lookup (in an F × 512 embedding matrix).

Experimental Study
In this section, we describe the results of our experimental study. First, we report how the character images were created, followed by the experiments for the neural machine translation task. To expand on these results, we describe experiments and results for joint segmentation and partof-speech tagging, as well as for measuring correlation with semantic similarity judgments.

Character Images
We convert each character from the source vocabulary into a 22 × 22 binary representation of its glyph. 1 This was the lowest resolution that did not collapse nearby strokes into indistinguishable clusters.

Experiments for Machine Translation
For the machine translation experiments, we trained a NMT model to translate from Chinese to English. In the following sections, we describe the model, the data and the training settings, followed by the results.
NMT Model We used a standard sequence-tosequence model with a recurrent encoder and attention-based decoder (Luong et al., 2015). This architecture does not represent the state of the art in neural machine translation. However, due to its wide adoption in empirical research, there is broad knowledge about suitable hyper-parameters, which makes it a preferred choice for our study.
The coarse architecture of this model can be described by an encoder enc and a decoder dec. The encoder enc is a function that takes a tokenized text T in a source vocabulary as input and computes a representation that is the input for the subsequent decoder. The decoder dec then creates a sequence of tokens in the target vocabulary. The final output is dec(enc(T )).
The input of the encoder enc is first transformed to a dense vector representation, i.e., each input token is transformed by function D of Section 3.1. In the following experiments, we evaluate different choices for D.
Data For training the translation model, we used a subset of the available data from the WMT 2017 Workshop on Machine Translation (Bojar et al., 2017), 2 , namely the News Commentary v12 corpus as well as Casia2015 and Neu2017 from the CWMT Corpus. Overall, these datasets yielded 3,277,330 sentences. For development and evaluation, we used WMT 2017 dev and test data, respectively.
Training For Chinese, we use characters as input. For English, we use byte pair encoding (Sennrich et al., 2016) with ≈32,000 sym-   (Paszke et al., 2017). The recurrent neural networks in the encoder and decoder are LSTMs (Hochreiter and Schmidhuber, 1997). Table 3 summarizes the hyper-parameters of the NMT model. For training, we used Adam (Kingma and Ba, 2014) with the standard parameters of PyTorch. We used a learning rate of 10 −3 for 4 epochs, then we halved the learning rate every epoch until we reached 10 −5 . For regularization, we used dropout with probability 0.2 in both encoder and decoder RNNs. We pretrained the decoder for 20 epochs to a perplexity of 108.21 on the validation data. This did result in: faster convergence, improved fluency of translations, and less variance of evaluation results. We trained all the models for 10 epochs. We selected the models by the best batchbased approximate BLEU score on the development set. This worked slightly better than selecting them by the best perplexity. We perform translation via beam search with a beam-size of 10 and length normalization of 0.9.
We trained the NMT models with each of the character encoders of Table 1 as well as with lookup-based embeddings (EMB) as a baseline.
Results Table 4 summarizes the results of the machine translation experiments. We include the result from a WMT 2017 system to give context for the expected BLEU score in this task. This result is from a baseline NMT system for Chinese to English (Wang et al., 2017), which is most comparable to our approach (i.e., no reranking, no ensembles, no special treatment of names and numbers). However, a crucial difference is that the WMT 2017 system uses pre-segmented text, which yields a vocabulary size of 300k on the   source side, while our system is character-based.
Our results indicate that using only one fully connected transformation layer FC 1L for character encoding is possible but does not yield comparable results to the best-performing convolutional architectures. However, the FC 2L comes close to the CNN models. Note that CNN+SM+FC 4096 needs a larger number of CNN features to perform comparable to CNN+FC+ReLU 1024. In terms of BLEU score, the best character encoders perform equal to a standard lookup embedding.
To gain insight into the differences between the models, we split the test data into three buckets. Each sentence is scored with 1/n n i=1 #t i , where #t i is the training data frequency of character t i . We partition the test data into a low and high bucket with the sentences scored in the lower and upper quartile, respectively, and a mid bucket with the remaining 2nd and 3rd quartile. Table 5 Figure 1: Distributions of the top-3 softmax activation magnitudes per character.
shows the results per bucket for CNN+FC+ReLU 1024, CNN+SM+FC 4096, FC 2L and EMB. Interestingly, the CNN+FC+ReLU 1024 and EMB perform differently well in the frequency buckets. The CNN+FC+ReLU 1024 model, surprisingly, performs better than EMB for highfrequency characters, while this is inversed for the low-frequency characters.
For the CNN+SM+FC encoders, our hypothesis was that the sparse activation of the softmax could act like a soft lookup function; i.e., it selects only few rows for the subsequent fully connected layer. We measured the top-3 softmax activation magnitudes per character from a random sample of 350 sentences from the training data (a total of 11234 words). As shown in Figure 1, the activations are indeed spiked, which supports our hypothesis. However, in our evaluation, we found no advantage over CNN+FC+ReLU encoders.

Experiments for Joint Part-of-Speech Tagging and Segmentation
In the translation experiment we evaluated the token representations for a syntactic and semantic signal. In this experiment, we want to investigate the morpho-syntactic information in the representations. To evaluate this, we employ the task of joint part-of-speech tagging and segmentation.
Model We used a Linear-Chain-CRF with a CNN encoder (Strubell et al., 2017) to compute the emissions. The encoder is a three-layer CNN with kernel size 3, iteratively growing dilations, residual connections, ReLU activations, and a hidden size of 512. We do not report numbers for a BI-RNN-CRF similar to Shao et al. (2017), because our implementation did not yield their results for unigrams which is most likely caused by the different embeddings. 1-gram BI-RNN-CRF*/** 92.45 3-gram BI-RNN-CRF* 94.07 Table 6: Results for joint segmentation and POS tagging on the test set of CTB-5.0. (*) 1-gram and 3-gram BI-RNN-CRF reported by Shao et al. (2017), (**) the 1-gram BI-RNN-CRF result was reported on the development set.
Data The Chinese Treebank 5.0 (Xue et al., 2005) has joint annotations for word segmentation and part-of-speech tags. The commonly used cross validation split was reported by Jiang et al. (2008). Instead of the BIES tagging scheme (begin, inside, end, single), we used the BI scheme, which worked better for our models.
Training We use the parameters from the EMB, CNN+FC+ReLU 1024, FC2L 512 and CNN+SM+FC 4096 of the NMT task. We either adjust these parameters during training (adapt) or keep them fixed (fix). For training, we used Adagrad (Duchi et al., 2010) with an initial learning rate of 0.1, and dropout for regularization with probability 0.1 for the encoder and 0.5 for the output classifier. We trained the models for 50 epochs and selected the model with the best accuracy on the development set.

Results
Our results reported in Table 6 are averaged over two distinct runs each. For comparison, we show results reported for the 1-gram and 3-gram lookup model of Shao et al. (2017). Our models correspond to the 1-gram model, as they are character-based. We find no clear difference between lookup-based and the best character encoder models. Interestingly, CNN+SM+FC 4096 (softmax) is the weakest model. The sparseness in the in the signal apparently removes syntactic information. Comparing the adapt and the fix setting, we see that adapting the character encoder during training is -in most cases-not helpful. This is most likely due to the small amount of training data.

Experiments for Word Similarity
In our final experiment, we evaluated the representation of semantic information. The task is to compute a similarity score for pairs of words. The evaluation is the Spearman's correlation between the scores and numerical similarity judgments by humans. The words in this data set are translated into traditional Chinese. The training data from the NMT task is mostly from news and web sources, which is why our vocabulary contains many but not all Chinese characters. The characters in the modern simplified Chinese sometimes can be visually similar to their predecessors in traditional Chinese, i.e. they share visual components with a related meaning. Therefore, we can also evaluate the capability of the models to generalize to unseen characters.

Data and Experimental Setup
For the experiments we use the WordSim-240, WordSim-296 and SimLex-999 datasets provided and described by Su and Lee (2017). Due to the use of traditional Chinese we are missing at least one character in 430 out of 1536 test examples. On average, we are missing 1.1 characters per word, where each word has in average 2.13 characters. We split the data into word pairs in which all characters are completely covered (seen words), and into word pairs where at least one character is not seen (unseen words). We evaluate unseen words in two settings: either we remove the unseen characters (seen chars) or not (all chars). We did not train any model for this experiments. Most of the words in the dataset are composed of multiple characters, therefore, we average the output of the character encoder into a single wordvector. Subsequently, we compute the cosine similarity between these vectors of word pairs.

Results
The results in Table 7 show that, for words with seen characters, the lookup embeddings correlate better with human judgments of similarity than the embeddings based on character encoders. However, especially the results for unseen words in WordSim-240 show that the character encoders can generalize to all characters. Surprisingly, the FC2L model, the model without CNNs, yields the best results. In the results reported by Su and Lee (2017), the lookup embedding model SG-EMB performs much better than EMB. However, they learn a RNN-based character composition, while we average the embeddings. SG-EMB 0.59 n/a n/a 0.59 n/a n/a 0.36 n/a n/a SG-CNN 0.34 n/a n/a 0.36 n/a n/a 0.24 n/a n/a Table 7: Results for semantic similarity on the WordSim-240, WordSim-296 and SimLex-999 data. Lookup SkipGram (SG-EMB) and SkipGram char. encoder (SG-CNN) reported by Su and Lee (2017).

Conclusion
We have shown that it is possible to compute useful text representations from visual features, i.e., pixels of Chinese characters. We used neural machine translation as a framework for training and evaluating token representations. In the NMT experiment, we found that representations from visual features are competitive to lookup embeddings in a standard NMT model. In contrast to our expectations, the visual features outperform lookup embeddings on high-frequency characters, but are weaker in low-frequency characters. We conjecture that one of the reasons is that shared visual features between characters can have related meanings or related syntactic functions, but also the opposite can be true. For example, 沐 is the reflexive verb of "washing" and it contains the phonetic component 木 , which can also mean "tree". Therefore, the visual information for rare characters introduces probably as many difficulties as it can be helpful. We performed additional experiments akin to prior studies and could show that the advantage of a representation by visual features is the ability to generalize to unseen Chinese characters. Also, we could show that joint part-of-speech tagging and segmentation achieved similar results with representation from visual features.
With regard to the character encoder architecture, we find only a slight advantage of using CNNs to simply using two fully connected layers in the NMT experiment, while 2 fully connected layers outperform CNNs in the similarity experiment. Whether or not there is an advantage could only be answered by an exhaustive hyperparameter search.
Future work is to explore how we can create a model that can distinguish helpful from unhelpful visual information for rare characters.