Character-Based Neural Networks for Sentence Pair Modeling

Sentence pair modeling is critical for many NLP tasks, such as paraphrase identification, semantic textual similarity, and natural language inference. Most state-of-the-art neural models for these tasks rely on pretrained word embedding and compose sentence-level semantics in varied ways; however, few works have attempted to verify whether we really need pretrained embeddings in these tasks. In this paper, we study how effective subword-level (character and character n-gram) representations are in sentence pair modeling. Though it is well-known that subword models are effective in tasks with single sentence input, including language modeling and machine translation, they have not been systematically studied in sentence pair modeling tasks where the semantic and string similarities between texts matter. Our experiments show that subword models without any pretrained word embedding can achieve new state-of-the-art results on two social media datasets and competitive results on news data for paraphrase identification.

Most, if not all, of these state-of-the-art neural models (Yin et al., 2016;Parikh et al., 2016;He and Lin, 2016;Tomar et al., 2017;Shen et al., 2017) have achieved the best performances for these tasks by using pretrained word embeddings, but results without pretraining are less frequently reported or noted.In fact, we will show that, even with fixed randomized word vectors, the pairwise word interaction model (He and Lin, 2016) based on contextual word vector similarities can still achieve strong performance by capturing identical words and similar surface context features.Moreover, pretrained word embeddings generally have poor coverage in social media domain where out-of-vocabulary rate often reaches over 20% (Baldwin et al., 2013).
We investigated the effectiveness of subword units, such as characters and character n-grams, in place of words for vector representations in sentence pair modeling.Though it is well-known that subword representations are effective to model out-of-vocabulary words in many NLP tasks with a single sentence input, such as machine translation (Luong et al., 2015;Costa-jussà and Fonollosa, 2016), language modeling (Ling et al., 2015;Vania and Lopez, 2017), and sequence labeling (dos Santos and Guimarães, 2015;Plank et al., 2016), they are not systematically studied in the tasks that concern pairs of sentences.Unlike in modeling individual sentences, subword representations have impacts not only on the out-of-vocabulary words but also more directly on the relation between two sentences, which is calculated based on vector similarities in many sentence pair modeling approaches (more details in Section 2.1).For example, while subwords may capture useful string similarities between a pair of sentences (e.g.spelling or morphological variations: sister and sista, teach and teaches), they could introduce errors (e.g.similarly spelled words with completely different meanings: ware and war).
To better understand the role of subword embedding in sentence pair modeling, we performed experimental comparisons that vary (1) the type of subword unit, (2) the composition function, and (3) the datasets of different characteristics.We also presented experiments with language mod-eling as an auxiliary multi-task learning objective, showing consistent improvements.Taken together, subword and language modeling establish new state-of-the-art results in two social media datasets and competitive results in a news dataset for paraphrase identification without using any pretrained word embeddings.

Sentence Pair Modeling with Subwords
The current neural networks for sentence pair modeling (Yin et al., 2016;Parikh et al., 2016;He and Lin, 2016;Liu et al., 2016;Tomar et al., 2017;Wang et al., 2017;Shen et al., 2017, etc) follow a more or less similar design with three main components: (a) contextualized word vectors generated via Bi-LSTM, CNN, or attention, as inputs; (b) soft or hard word alignment and interactions across sentences; (c) and the output classification layer.Different models vary in implementation details, and most importantly, to capture the same essential intuition in the word alignment (also encoded with contextual information) -the semantic relation between two sentences depends largely on the relations of aligned chunks (Agirre et al., 2016).In this paper, we used pairwise word interaction model (He and Lin, 2016) as a representative example and staring point, which reported robust performance across multiple sentence pair modeling tasks and the best results by neural models on social media data (Lan et al., 2017).

Pairwise Word Interaction (PWI) Model
Let w a = (w a 1 , ..., w a m ) and w b = (w a 1 , ..., w b n ) be the input sentence pair consisting of m and n tokens, respectively.Each word vector w i ∈ R d is initialized with pretrained ddimensional word embedding (Pennington et al., 2014;Wieting et al., 2015Wieting et al., , 2016)), then encoded with word context and sequence order through bidirectional LSTMs: For all word pairs (w a i , w b j ) across sentences, the model directly calculates word pair interactions using cosine similarity, Euclidean distance, and dot product over the outputs of the encoding layer: The above equation can also apply to other states ← − h , ← → h and h + , resulting in a tensor D 13×m×n after padding one extra bias term.A "hard" attention is applied to the interaction tensor to further enforce the word alignment, by sorting the interaction values and selecting top ranked word pairs.A 19-layer-deep CNN is followed to aggregate the word interaction features and the softmax layer to predicate classification probabilities.

Embedding Subwords in PWI Model
Our subword models only involve modification of the input representation layer in the pairwise interaciton model.Let c 1 , ..., c k be the subword (character unigram, bigram and trigram) sequence of a word w.The subword embedding matrix is , where each subword is encoded into the d ′ -dimension vector.The same subwords will share the same embeddings.We considered two different composition functions to assemble subword embeddings into word embedding: Char C2W (Ling et al., 2015) applies Bi-LSTM to subword sequence c 1 , ..., c k , then the last hidden state − → h char k in forward direction and the first hidden state ← − h char 0 of the backward direction are linearly combined into word-level embedding w: where W f , W b and b are parameters.
Char CNN (Kim et al., 2016) applies a convolution operation between subword sequence matrix C and a filter F ∈ R d ′ ×l of width l to obtain a feature map f ∈ R k−l+1 : where A, B = T r(AB T ) is the Frobenius inner product, b is a bias and f j is the jth element of f .We then take the max-over-time operation to select the most important element: After applying q filters with varied lengths, we can get the array w = [y 1 , ..., y q ], which is followed by a one-layer highway network to generate final word embedding.

Auxiliary Language Modeling (LM)
We adapted a multi-task structure, originally proposed by (Rei, 2017) for sequential tagging, to further improve the subword representations in sentence pair modeling.In addition to training the model for sentence pair tasks, we used a secondary language modeling objective that predicts the next word and previous word using softmax over the hidden states of Bi-LSTM as follows: ← − where The Bi-LSTM here is separate from the one in PWI model.The language modeling objective can be combined into sentence pair modeling through a joint objective function: which balances subword-based sentence pair modeling objective E and language modeling with a weighting coefficient γ.

Datasets
We performed experiments on three benchmark datasets for paraphrase identification; each contained pairs of naturally occurring sentences manually labeled as paraphrases and non-paraphrases for binary classification: Twitter URL (Lan et al., 2017) was collected from tweets sharing the same URL with major news outlets such as @CNN.This dataset keeps a balance between formal and informal language.PIT-2015 (Xu et al., 2014(Xu et al., , 2015) ) comes from the Task 1 of Semeval 2015 and was collected from tweets under the same trending topic, which contains varied topics and language styles.MSRP (Dolan and Brockett, 2005) was derived from clustered news articles reporting the same event in formal language.Table 1 shows vital statistics for all three datasets.

Settings
To compare models fairly without implementation variations, we reimplemented all models into a single PyTorch framework. 1We followed the setups in (He and Lin, 2016) and (Lan et al., 2017) for the pairwise word interaction model, and used the 200-dimensional GloVe word vectors (Pennington et al., 2014), trained on 27 billion words from Twitter (vocabulary size of 1.2 milion words) for social media datasets, and 300dimensional GloVe vectors, trained on 840 billion words (vocabulary size of 2.2 milion words) from Common Crawl for the MSRP dataset.For cases without pretraining, the word/subword vectors were initialized with random samples drawn uniformly from the range [0.05, 0.05].We used the same hyperparameters in the C2W (Ling et al., 2015) and CNN-based (Kim et al., 2016) compositions for subword models, except that the composed word embeddings were set to 200-or 300-dimensions as the pretrained word embeddings to make experiment results more comparable.For each experiment, we reported results with 20 epochs.

Results
Table 2 shows the experiment results on three datasets.We reported maximum F1 scores of any point on the precision-recall curve (Lipton et al., 2014) following previous work.

Word Models
The word-level pairwise interaction models, even without pretraining (randomzied) or fine-tuning (fixed), showed strong performance across all three datasets.This reflects the effective design of the BiLSTM and word interaction layers, as well as the unique character of sentence pair modeling, where n-gram overlapping positively signifies the extent of semantic similarity.As a reference, a logistic regression baseline with simple n-gram (also in stemmed form) overlapping features can also achieve good performance on PIT-2015 and MSRP datasets.
With that being said, pretraining and fine-tuning word vectors are mostly crucial for pushing out the last bit of performance from word-level models.
Subword Models (+LMs) Without using any pretrained word embeddings, subword-based pair-  wise word interaction models can achieve very competitive results on social media datasets compared with the best word-based models (pretrained, fixed).For MSRP with only 9% of OOV words (Table 1), the subword models do not show advantages.Once the subword models are trained with multi-task language modeling (Sub-word+LM), the performance on all datasets are further improved, outperforming the best previously reported results by neural models (Lan et al., 2017).A qualitative analysis reveals that subwords are crucial for out-of-vocabulary words while language modeling ensures more semantic and syntactic compatibility (Table 3).

Combining Word and Subword Representations
In addition, we experimented with combining the pretrained word embeddings and subword models with various strategies: concatenation, weighted average, adaptive models (Miyamoto and Cho, 2016) and attention models (Rei et al., 2016).
The weighted average outperformed all others but only showed slight improvement over word-based models in social media datasets; other combination strategies could even lower the performance.

Model Ablations
In the original PWI model, He and Lin (2016) performed pattern recognition of complex semantic relationships by applying a 19-layer deep convolutional neural network (CNN) on the word pair interaction tensor (Eq.5).However, the SemEval task on Interpretable Semantic Textual Similarity (Agirre et al., 2016) in part demonstrated that the semantic relationship between two sentences depends largely on the relations of aligned words or chunks.Since the interaction tensor in the PWI model already encodes word alignment information in the form of vector similarities, a natural question is whether a 19-layer CNN is necessary.Table 4 shows the results of our systems with and without the 19-layer CNN for aggregating the pairwise word interactions before the final softmax layer.While in most cases the 19-layer CNN helps to achieve better or comparable performance, it comes at the expense of ∼25% increase of training time.An exception is the characterbased PWI without language model, which performs well on the PIT-2015 dataset without the 19layer CNN and comparably to logistic regression with string overlap features (Eyecioglu and Keller, 2015).A closer look into the datasets reveals that PIT-2015 has a similar level of unigram overlap as the Twitter URL corpus (Table 5), 2 but lower character bigram overlap (indicative of spelling variations) and lower word bigram overlap (indicative of word reordering) between the pairs of sentences that are labeled as paraphrase.
The 19-layer CNN appears to be crucial for the MSRP dataset, which has the smallest training size and is skewed toward very high word overlap. 2 For the two social media datasets, our subword 2 See more discussions in (Lan et al., 2017).models have improved performance compared to pretrained word models regardless of having or not having the 19-layer CNN.

Conclusion
We presented a focused study on the effectiveness of subword models in sentence pair modeling and showed competitive results without using pretrained word embeddings.We also showed that subword models can benefit from multi-task learning with simple language modeling, and established new start-of-the-art results for paraphrase identification on two Twitter datasets, where outof-vocabulary words and spelling variations are profound.The results shed light on future work on language-independent paraphrase identification and multilingual paraphrase acquisition where pretrained word embeddings on large corpora are not readily available in many languages.C-0095.The content of the information in this document does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation here on.

Table 1 :
Statistics of three benchmark datasets for paraphrase identification.The training and testing sizes are in numbers of sentence pairs.The number of unique in-vocabulary (INV) and out-of-vocabulary (OOV) words are calculated based on the publicly available GloVe embeddings (details in Section 3.2).

Table 2 :
Results in F1 scores on Twitter-URL, PIT-2015 and MSRP datasets.The best performance figure in each dataset is denoted in bold typeface and the second best is denoted by an underline.Without using any pretrained word embeddings, the Subword+LM models achieve better or competitive performance compared to word models.

Table 3 :
Nearest neighbors of word vectors under cosine similarity in Twitter-URL dataset.

Table 4 :
Comparison of F1 scores between the original PWI model with 19-layer CNN for aggregation and the simplified model without 19-layer CNN on Twitter-URL, PIT-2015 and MSRP datasets.The number of parameters and training time per epoch shown are based on the Twitter URL dataset and a single NVIDIA Pascal P100 GPU.

Table 5 :
Character and word overlap comparison.