Neural Models for Detecting Binary Semantic Textual Similarity for Algerian and MSA

We explore the extent to which neural networks can learn to identify semantically equivalent sentences from a small variable dataset using an end-to-end training. We collect a new noisy non-standardised user-generated Algerian (ALG) dataset and also translate it to Modern Standard Arabic (MSA) which serves as its regularised counterpart. We compare the performance of various models on both datasets and report the best performing configurations. The results show that relatively simple models composed of 2 LSTM layers outperform by far other more sophisticated attention-based architectures, for both ALG and MSA datasets.


Introduction
Detecting Semantic Textual Similarity (STS) aims to predict a relationship between a pair of sentences based on a semantic similarity score. It is a well-established problem (Agirre et al., 2012) which deals with text comprehension and which has been framed and tackled differently (Beltagy et al., 2013(Beltagy et al., , 2014. In this work we focus on deep learning approach. For example, Baudis and Šedivý (2016) frame the problem as a sentence-pair scoring using binary or graded scores indicating the degree to which a pair of sentences are related.
Solutions to detecting semantic similarity benefit from the recent success of neural models applied to NLP and have achieved new state-of-theart performance (Parikh et al., 2016;Chen et al., 2017). However, so far it has been explored only on fairly large well-edited labelled data in English. This paper explores a largely unexplored question which concerns the application of neural models to detect binary STS from small labelled datasets. We take the case of the language used in Algeria (ALG) which is an under-resourced language with several linguistic challenges. ALG is a collection of local colloquial varieties with a heavy use of code-switching between different languages and language varieties including Modern Standard Arabic (MSA), non-standardised local colloquial Arabic, and other languages like French and Berber, all written in Arabic script normally without the vowels.
ALG and MSA are two Arabic varieties which differ lexically, morphologically, syntactically, etc., and therefore represent different challenges for NLP. For instance, ALG and MSA share some morphological features, but at the same time the same morphological forms have different meanings. For instance, a verb in the 1 st person singular in ALG is the same 1 st person plural in MSA. The absence of morpho-syntactic analysers for ALG makes it challenging to analyse such texts, especially when ALG is mixed with MSA. Furthermore, this language is not documented, i.e., it does not have lexicons, standardised orthography, and written morpho-syntactic rules describing how words are formed and combined to form larger units. The nonexistence of lexicons to disambiguate the senses of a word based on its language or language variety makes resolving lexical ambiguity challenging for NLP because relying on exact word form matching is misleading.
b. I spent one week at my parents' house and when I came back I found that my son made a big mess. After that my husband changed his opinion and never allowed me to stay over night (at my parents' house).
b. In Mawlid we prepare Couscous for lunch, and you what will you prepare (for lunch)?
In many cases, while the same word form has several meanings depending on its context, different word forms have the same meaning. As an illustration, consider examples (1) and (2) which are user-generated texts taken from our corpus (Section 3.1.1). In (1), the same word form " " occurs three times with different meanings: "house", "made", and "changed" respectively. Whereas in (2), the different word forms " " and " " mean both "lunch".
We mention these examples to provide a basic background for a better understanding of the challenges faced while processing this kind of real-world data using the current NLP approaches and systems that are designed and trained mainly on well-edited standardised monolingual corpora. We could, for instance, distinguish the meanings of " " in (1) if we knew that the 1 st occurrence is a noun and the two others are verbs. Likewise, if we had a tool to distinguish between ALG and MSA, it were easier to detect the meaning of " " as "lunch" in ALG rather than the MSA meaning "tomorrow".
Traditional models for detecting STS cannot be applied on such data because they require existing resources and tools, such as tokeniser, stemmer, PoS tagger, etc. to pre-process the data and extract useful features assuming that the data is correctly spelled (standardised orthography). Thus using deep neural networks (DNNs) is promising because representations can be learned in an unsupervised way. In particular, when trained end-toend, inputs are mapped directly to the desired outputs without the need to handcraft features. Nevertheless, this learning approach based on pattern matching requires lot of data to learn useful patterns. Besides there are only a few cleaned and labelled textual corpora available for some languages and creating new ones is labour intensive.
Our contributions are as follows. (i) We introduce a newly built (small) ALG dataset for STS. (ii) We compare the performance of different DNN configurations on this dataset, namely: various combinations of Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), pre-training of embeddings, including a replication of two new state-of-the art attention models. (iii) We test whether increasing the dataset size helps. (iv) We test whether language regularisation helps. For this purpose, we run the same experiments on a regularised and comparable MSA translation of the ALG dataset.
The paper is structured as follows. In Section 2, we briefly review some STS applications. In Section 3, we describe our experimental setup including data and models. In Section 4, we discuss the results and conclude with our future plans in Section 5.

Related Work
Diverse techniques and formalisms have been used to deal with various semantic-related tasks. Among others, machine learning has been applied to detect semantic textual relatedness such as Textual Entailment (TE) (Nielsen et al., 2009), STS (Agirrea et al., 2016), Paraphrase Identification (PI) (Liang et al., 2016), etc. Earlier systems use a combination of various handcrafted features and are trained on relatively small datasets. For example, Dey et al. (2016) uses Support Vector Machines with a set of lexical, syntactic, semantic and pragmatic features. As discussed earlier, these features are not available from our dataset.
These tasks have recently attracted more attention when DNNs became practical, mainly due to the availability of large labelled datasets such as the Stanford Natural Language Inference corpus (SNLI) containing 570K sentence pairs (Bowman et al., 2015), Sentences Involving Compositional Knowledge (SICK) containing about 10K sentence pairs (Marelli et al., 2014), the Microsoft Research WikiQA Corpus (WIKIQA) containing more than 23K sentence pairs (Yang et al., 2015), the Quora dataset released by Kaggle competition consisting of 400K potential question duplicate pairs 1 , and the Microsoft Research Paraphrase (MSRP) consisting of more than 5K sentence pairs (Dolan and Brockett, 2005).
We follow the approach of Baudis and Šedivý (2016) who consider that several tasks dealing with detecting semantic relatedness are technically similar and can be formulated as sentence-pair scoring. They propose a generic framework for text comprehension for evaluating and comparing existing systems. Several DNN systems have been proposed. For instance, Mueller and Thyagarajan (2016) propose a siamese recurrent architecture using Manhattan LSTM (MaLSTM) for STS. They use word embeddings supplemented with synonymy information, LSTM and Manhattan distance to compose sentence representations.
Additionally, complex DNN systems with various attention mechanisms have been proposed to deal with more than one semantic similarity task at the same time. For instance, Yin et al. (2015) apply attention to represent mutual influence between the input sentence pairs. Similarly, Parikh et al. (2016) propose the Decomposable Attention Model (DecompAtten) which relies on alignment using neural attention to decompose the task of natural language inference into sub-tasks which are aggregated and used to predict the output. In the same direction, Chen et al. (2017) propose the Enhanced Sequential Inference Model (ESIM) composed of a bidirectional LSTM (BiLSTM) encoder, and a soft alignment which computes attention weights to determine the relevance between two input sentences. Then they use another BiL-STM layer to compose local inference information and aggregate the output by applying average and max pooling, and concatenating all in one vector.
All preceding models involve considerable sophistication of design and sometimes require specific dataset annotation. This is to say they are normally trained on large well-edited and labelled datasets that are available for English but are unavailable for most other languages. Unlike the previous work, we will compare the performance of two presumably best performing architectures to simpler architectures similar to MaLSTM but with different additional components on a small unedited dataset.

ALG STS data
To the best of our knowledge, there is no readyto-use ALG data for any semantic similarity related task prior to this work. As a basis we use an extended version of the ALG unlabelled dataset (Adouane et al., 2018) which currently contains 408,832 unedited short colloquial texts (more than 6 million words) collected from online discussion forums. For the STS task we created a dataset of 3,000 sentence pairs as follows. We randomly selected 1,000 sentences from the ALG unlabelled data, including various topics and text lengths. We asked two ALG native speakers to produce for each given sentence two more sentences: one which is semantically equivalent and the other can be semantically similar but not equivalent, i.e., it could include the same words or could be about the same topic.
b. No, it is not beautiful, pink is outdated.
I do not like pink, it is not fashionable.
b. I offered to my mother a chocolate pie.
I like the chocolate pie that my mother baked.
In (3), the two sentences are semantically equivalent but in (4) the two sentences are roughly about the same topic and include "chocolate pie", "mother" and "I" but some important information differs -like who did what.
The annotators were free to use whatever words as long as the produced sentences sounded natural to them and the above instructions were respected. We provided them with two examples of the desired sentences and explained the difference. We combined all the sentences and created 3,000 unique sentence pairs.
In the second part of dataset creation, we asked three different native speakers to provide a similarity score between 0-5 for each sentence pair following the guidelines used in the SemEval-2016 shared task (Agirrea et al., 2016). Finally, another annotator performed manual checking and majority voting of the annotations.
Because the annotators assigned scores according to their judgement, the resulting data is not balanced in terms of the number of instances per class (0-5) as shown in Table 1. The corpus contains 36,767 words, 7,074 unique words and sentence average length of 5.19 words or 34 characters.

Score Interpretation #Pairs 0
The two sentences are completely dissimilar. 1,550 1 The two sentences are not equivalent, but are on the same topic. 237 2 The two sentences are not equivalent, but share some details. 140 3 The two sentences are roughly equivalent, but some important information differs. 63 4 The two sentences are mostly equivalent, but some unimportant details differ. 16 5 The two sentences are completely equivalent, as they mean the same thing. 994 Table 1: Annotation guidelines and the number of instances in the ALG STS dataset.
We first tried to predict the graded six similarity scores as multi-class STS, but the systems (Section 3.2) only predicted the most frequent classes, namely scores 0 and 5. This behaviour suggests that given the size of the dataset and the number of instances for each class, the classes are not distinguishable enough. Therefore, we re-framed the task as a binary STS: either two sentences are semantically equivalent or not, rather than predicting their graded similarity (Agirre et al., 2015;Xu et al., 2015). To this end, we merged all scores which do not capture semantic equivalence (0 to 4) into a single class, and refer to them as non-equivalent. The remaining score of 5 stands on its own as completely equivalent. The resulting binary labelled data contains 994 equivalent sentence pairs and 2,006 non-equivalent sentence pairs.

MSA STS data
Contrary to ALG, MSA is a well-represented Arabic variety with standardised spelling. We use a large MSA Wikipedia corpus 2 consisting of more than 52 million tokens. We automatically removed all words written in non-Arabic script and punctuation. We refer to this corpus as MSA unlabelled data.
We also created a labelled STS corpus for MSA by commissioning another pair of ALG native speakers to faithfully translate the ALG STS dataset into MSA. They were instructed to keep the order of words and structures as close as possible to the ALG sentences without changing the 2 The MSA corpus was downloaded from: http:// goo.gl/d7pxZb. meaning. We manually checked the quality of the translation, corrected some minor misspellings and checked the corresponding similarity scores (0-5). We proceeded in the same way as for ALG and created a binary MSA STS dataset including equivalent and non-equivalent sentence pairs.
Both binary and multi-class STS MSA datasets have the same number of sentence pairs as their ALG corresponding datasets. However, the MSA datasets have a smaller vocabulary, consisting of only 5,527 unique words from a total of 37,832 words. The average sentence length is 6.84 words or 33.26 characters. The difference in the vocabulary size is mainly due to misspellings and spelling variations in the ALG corpus: it is nonstandardised language. Yet both ALG and MSA datasets have relatively short sentences and they are about the same topics since one is a translation of the other.

Models
All models have the same basic structure. They consist of two identical siamese networks, one for each input sentence as shown in Figure 1. The main differences between the models are in the embeddings, the sentence encoder, the distance measure, and the objective function for the final prediction.

Embeddings
We use two kinds of embedding layers. First, an embedding layer trained only on the training data based either on characters or words, initialised either with a uniform or a normal distribution. We refer to these embeddings as trainable as a contrast to pre-trained embeddings. Second, we pre-trained a word2vec and FastText embeddings on the larger unlabelled data mentioned in Section 3.1, using the Gensim (Řehůřek and Sojka, 2010) and FastText (Bojanowski et al., 2016) libraries. For word2vec embeddings, we used a context size of 5 words, minimum occurrence of 1 and dimension of 300. For FastText embeddings, we used dimension of 300, range of subcharacters between 3-5 characters, and a context size of 5 words, and training for 200 epochs. The goal of using pre-trained word embeddings is to test whether we can make use of the large unlabelled corpora. 3

Sentence Encoders
We use either an RNN or a CNN with different configurations to encode each sentence and output a representation for each. The sentence encoders are identical for both sentences and share weights. Here are some of the encoders that we experimented with.
RNN-based encoder consisting of a stack of standard and/or bidirectional LSTM layers with 300 units and a dropout rate of 3%.
CNN-based encoder consisting of a stack of convolution layers with 60 filters of size 5, with a relu activation and a dropout rate of 10%, followed by max pooling with a pool size of 3, followed optionally by a global average pooling and global max pooling multiplied together.
CNN-RNN-based encoder A combination of RNN and CNN encoders where we stack a number of convolution layers with 60 filters of size 5, with a relu activation and a dropout rate of 10%, followed by max pooling with a pool size of 3 and a number of RNN layers (either standard or bidirectional LSTMs).
Attention-based encoder Roughly put, the idea of an attention mechanism is to attend to some 3 The annotated data and the pre-trained embeddings are available from the 1 st author. parts of an input/output when deriving its representation (Bahdanau et al., 2014). We implement the Decomposable Attention (DecompAtten) and Enhanced Sequential Inference Model (ESIM) models, as described in Section 2.

Distance
The distance component serves to compose the sentence representations. We use standard distances such as Euclidean distance, Manhattan distance, and Cosine similarity.

Dense
Instead of using a distance measure between the sentence representations, we compose the two sentence representations by multiplication (multp), subtraction (subtr), summation (sum), or concatenation (conct) as in the ESIM model. This operation is followed by a dense layer. We indicate that this layer is optional by using a dotted frame in Figure 1. When it is used, we use a sigmoid activation with a binary cross-entropy loss.
Except for the pre-trained embeddings, all models are trained end-to-end for 300 epochs using a batch size of 64 and Adam optimiser with a learning rate of 0.001.

Results and Discussion
We randomly selected from the binary ALG STS dataset 250 sentence pairs of each class (equivalent and non-equivalent) as the test set (500 in total), 200 sentence pairs as a development set, and the remaining 2,300 sentence pairs as a training set. Note that balancing the test set is not essential. Likewise, we split the binary MSA STS data by taking the corresponding translations for each instance in the ALG dataset.
The hyper-parameters reported in Section 3.2 were selected based on the reported common values in the literature for similar tasks and finetuned on the development set. Moreover, because of the stochastic nature of the neural models 4 where the results vary between each training run, we report the average performance on the test set over 10 training runs for the best performing models trained on both training and development data following (Baudis and Šedivý, 2016;Yin et al., 2015).
In order to increase the size of the training data and to boost the instances of the minority class  Table 2: Average accuracy of the models (%). Acc is accuracy with non-augmented training data and Acc-aug with the augmented training data.
(equivalent sentence pairs), we duplicated equivalent sentence pairs by reversing their order so that each sentence pair appears only once in the same order. This is a standard data augmentation practice used to mitigate the limited availability of labelled training data (Yin et al., 2015;Mueller and Thyagarajan, 2016). The augmented training set contains 3,244 sentence pairs (1,488 equivalent and 1,756 non-equivalent pairs). Because there is no previous work reported for ALG on a similar task, we resort to the binary random guess, namely 50% as a baseline. We report the overall accuracy for the same models with and without the augmented training data, for both ALG and MSA separately. In Table 2, we only report the models that outperform the baseline.

Binary STS for ALG
Non-augmented data The results show that char-RNNs composed of 2 standard LSTM layers and trainable embedding layer with normal distribution (1) and (2) perform worse than their word-based counterparts (3) and (4). This result contradicts the conclusion that character models are better at modelling morphologically rich languages (Vylomova et al., 2017), and consequently they are better in dealing with misspellings and capturing spelling variations. The best performance is achieved by a wordbased 2-LSTM layer encoder and a trainable embedding layer (3), using multiplication as a distance with an accuracy of 85.06%. Nevertheless, char-RNN performs better with subtraction rather than multiplication as a distance (2). Adding pretrained embeddings word2vec (5) and FastText (6) to the word-level RNN in (4) decreases the accuracy by 2.33 and 2.05 points respectively. This effect could be caused by the noise in the ALG unlabelled data on which the embeddings were trained.
A 1-layer CNN with no pre-trained embeddings and using summation of the sentence representations as a distance (7) performs the best compared to the other options with CNN encoder but overall it performs quite poorly. Likewise combining 1-CNN and 1-LSTM layers as encoder (not shown in Table 2) does not have an effect over using only 1-CNN layer. The models predict all the test sentence pairs as non-equivalent. In other words, the network could not learn enough to properly distinguish between the two classes.
These results contrast those reported by Kadlec et al. (2015), namely that CNN models perform better with little data compared to RNN models. However, it is hard to quantify what is considered to be small apart from the number of examples. In general, neural models learn useful features when they are trained on enough representative data. That is to say it is not just a question of data size, but it is more about the complexity of the features and the functions that they should learn. In our case, we suspect that the sparsity and the noise in the data is making learning harder for CNN models.
Regarding attention-based encoders, ESIM (9) outperform DecompAtten (8), and both perform slightly better than the baseline. The poor performance of these models with little noisy data could be related to the fact that attending to some parts of a sentence or focusing on surface form similarity is misleading since the same word form can have different meanings and different word forms can have the same meaning, especially that the data does not contain named entities or punctuation or digits which could help alignment.
Augmented data All models benefit from the augmented data, except word-CNN (7) for which the gain is not clear. The performance of the char-  Models with subtraction as a distance benefit the most from the added data. Similar to their behaviour on non-augmented data, adding pretrained embeddings slightly decreases the performance of the model compared to not adding them. Comparing embeddings, word2vec causes slightly more drop in the performance of word-RNN compared to FastText. Attention-based models benefit also from the added data, but the gain is larger for DecompAtten compared to ESIM.
Looking at the performance of the models for each class shown in Table 3, it is clear that the RNN models are doing quite well for both classes whereas CNN and Attention-based models, not included for space limits, are too biased to the nonequivalent class. Figures in bold are meant to highlight the gain due to pre-trained embeddings.
Error analysis of the word-RNN model (4) shows that 7 equivalent sentence pairs are misclassified as non-equivalent and 28 non-equivalent sentence pairs are misclassified as equivalent. We manually checked the errors and found that most of the non-equivalent pairs misclassified as equivalent have at least one word in common as in example (5) but the words have a different meaning depending on their context. However, distinguishing between word senses is hard because the context is not entirely sufficient. Example (6) is an equivalent pair misclassified as non-equivalent. The common pattern among the misclassified examples is that they have no exact words in overlap. This could explain why attention-based encoders, with some form of alignment, fail to generalise to new instances. Probably there is a bias to the form with one meaning when senses are not sufficiently differentiated.
b. I saw a weird thing.
It is weird that I did not see it. (6) a.
b. I am thinking when the grant will be received. I wonder when the grant will be paid.

Binary STS for MSA
We now evaluate the performance of the same DNN configurations on parallel regularised MSA data using the same hyper-parameters as in Section 4.1. The results are reported in Table 2.
Non-augmented data Again, the word-RNN with multiplication (3) performs the best with an accuracy of 85.19%. The char-RNN (1) with the same settings achieves an accuracy of only 59.65%. Using subtraction, the char-RNN (2) slightly outperforms the word-RNN (4), with 69.02% and 68.90% accuracy respectively. Adding FastText (6) and word2vec (5) pre-trained embeddings causes the accuracy of the best word-RNN (4) of 68.90% to decrease slightly to 68.06% and 67.86% respectively. This could be due to the embeddings not distinguishing between the different senses of the same word, i.e., output one vector representation for each word form. Also the large MSA corpus on which the embeddings were trained can have different topical distribution than the MSA STS data. As with the ALG data, CNN (7) and attention-based encoders (8-9) behave the same.
Augmented data Trained on augmented data, models with subtraction yield the best performance compared to multiplication, and word-RNN (4) outperforms char-RNN (2)   and 71.37% accuracy respectively. Unlike when using the ALG data, pre-trained embeddings improve slightly the performance of (4) with 0.37 (6) and 1.26 (5) points gain in the error reduction respectively. The positive effect of the pre-trained models could be due to the fact that more regularities are captured. Training on augmented MSA data does not yield any significant gain over training on non-augmented data for CNN (7) and attention based models (8-9).
In Table 4 we report the performance of each model per class. Due to space limits, we do not include the CNN and attention-based models which are again struggling with the equivalent class and are biased towards the non-equivalent class. The gain from the pre-trained embedding is in bold. The models perform almost the same for both classes but slightly worse than with the ALG data.
Example (7) is a non-equivalent sentence pair misclassified as equivalent, and example (8) is an equivalent pair misclassified as non-equivalent by the word-RNN model (5).
b. I also tried the cake and it was great, I discovered that my kids finished it. I tested her many times and she was jealous and envious.
b. Wish they change this presenter. Hope they will replace this presenter.
It is hard to explain why these examples are misclassified, except that there is not enough context to discover the meaning of the words. For instance, in (8) the words in bold " " , " " are synonyms in these two sentences, and the two sentences have two more word overlaps " " and " " with the same meaning. This should help classifying the two sentences as equivalent, but it is not the case possibly because their contexts are different.

Conclusion and Future Work
We have presented a new STS dataset for ALG user-generated short texts and its MSA translation. We then described the neural network models trained end-to-end with different configurations and compared their performances on a binary STS task. The results show that relatively simple model architectures, composed of two word-based LSTM layers with subtraction as explicit similarity measure used in the training task, suit better our data compared to the other more sophisticated architectures which might require more data to achieve better performance.
We ran the same experiment on the MSA data, but the results were not really different from the ALG data. However, pre-training embeddings performed better with MSA, probably because the language is more regular and knowing some structure ahead helps. The performance improved with more data for the minority class (equivalent sentence pairs) for both ALG and MSA. However, surprisingly the gain of some models with ALG is greater than their gain with MSA. This is probably caused by the noisiness and the sparsity of the data, the linguistic differences between MSA and ALG, the data size, or all these factors together. Further and deeper experiments and analyses are needed for a better understanding of the results.
Overall, the results of the end-to-end training are promising and could be generalised to other related languages or language varieties with the same under-resource settings. As a future work, we want to explore ways to improve the learning capability of neural models from small noisy datasets without handcrafted features, for example by reducing the noise in the colloquial data (ALG) by normalising spelling variation.