IISCNLP at SemEval-2016 Task 2: Interpretable STS with ILP based Multiple Chunk Aligner

Interpretable semantic textual similarity (iSTS) task adds a crucial explanatory layer to pairwise sentence similarity. We address various components of this task: chunk level semantic alignment along with assignment of similarity type and score for aligned chunks with a novel system presented in this paper. We propose an algorithm, iMATCH, for the alignment of multiple non-contiguous chunks based on Integer Linear Programming (ILP). Similarity type and score assignment for pairs of chunks is done using a supervised multiclass classification technique based on Random Forrest Classifier. Results show that our algorithm iMATCH has low execution time and outperforms most other participating systems in terms of alignment score. Of the three datasets, we are top ranked for answer- students dataset in terms of overall score and have top alignment score for headlines dataset in the gold chunks track.


Introduction and Related Work
Semantic Textual Similarity (STS) refers to measuring the degree of equivalence in underlying semantics(meaning) of a pair of text snippets.It finds applications in information retrieval, question answering and other natural language processing tasks.Interpretable STS (iSTS) adds an explanatory layer, by measuring similarity across chunks of segmented text, leading to an improved interpretability.It involves aligning multiple chunks across sentences with similar meaning along with similarity score(0-5) and type assignment.
Interpretable STS task was first introduced as a pilot task in 2015 Semeval STS task.Several approaches were proposed including NeRoSim (Banjade et al., 2015), UBC-Cubes (Agirre et al., 2015) and Exb-Thermis (Hänig et al., 2015).For the task of alignment, these submissions used approaches based on monolingual aligner using word similarity and contextual features (Md Arafat Sultan and Summer, 2014), JACANA that uses phrase based semimarkov CRFs (Yao and Durme, 2015) and Hungarian Munkers algorithm (Kuhn and Yaw, 1955).Other popular approaches for mono-lingual alignment include two-stage logistic-regression based aligner (Md Arafat Sultan and Summer, 2015), techniques based on edit rate computation such as (lien Maxe Anne Vilnat, 2011) and TER-Plus (Snover et al., 2009).(Bodrumlu et al., 2009) used ILP for word alignment problem.The iSTS task in 2016 introduced problem of many-to-many chunk alignment, where multiple non-contiguous chunks of the source can align with multiple-non-contiguous chunks of the target sentence, that previous monolingual alignment techniques cannot handle.We propose iMATCH, a new technique for monolingual alignment for many-to-many alignment at the chunk level, that can combine non-contiguous chunks based on integer linear programming (ILP).We also explore several features to define a similarity score between chunks to define the objective function for our optimization problem, similarity type and score classification modules.To summarize our contributions: • We propose a novel algorithm for monolingual alignment : iMATCH that handles many-to-arXiv:1605.01194v1[cs.CL] 4 May 2016 • We propose a system for Interpretable Semantic Textual Similarity: In the Gold-chunks track, our system is the top performer for the students-dataset and our alignment score is in that of the best two teams for all datasets.
2 System for Interpretable STS Our system comprises of (a) alignment module, iMATCH (section 2.1), (2) Type prediction module (section 2.2) and (3) Score prediction module (section 2.3).In the case of system chunks, there is an additional chunking module for segmenting input sentences into chunks.Figure 1 shows the block diagram of proposed system.Problem Formulation: Following is the formal definition of our problem.Consider source sentence (Sent 1 ) with M chunks and target sentence (Sent 2 ) with N chunks.Consider sets C 1 = {c 1 1 , . . ., c 1 M }, the chunks of sentence Sent 1 and C 2 = {c 2 1 , . . ., c 2 N }, the chunks of sentence Sent 2 .Consider sets S 1 ⊂ P owerSet(C 1 ) − φ and S 2 ⊂ P owerSet(C 2 ) − φ.Note that S 1 and S 2 are subsets of the power set (set of all possible combinations of sentence chunks) of C 1 and C 2 respectively.Consider sets S 1 ∈ S 1 and S 2 ∈ S 2 , which denotes a specific subset of chunks that are likely to be combined during alignment.Let concat(S 1 ) denote the phrase resulting from concatenation of chunks in S 1 and concat(S 2 ) denote the phrase resulting from concatenation of chunks of S 2 .Consider a binary The goal of alignment module is to determine the decision variables (Z S 1 ,S 2 ), which are non-zero.S 1 and S 2 can have more than one chunk (multiple alignment), that are not necessarily contiguous.Aligned chunks are further classified using Type classifier and Score classifier.Type prediction module identifies a pair of aligned chunks (concat(S 1 ), concat(s 2 ))) with a relation type like EQUI (equivalent), OPPO (opposite) etc. Score classifier module assigns a similarity score ranging between 0-5 for a pair of chunks.For the system chunks track, the chunking module, converts sentences Sent 1 , Sent 2 to sentence chunks C 1 , C 2 .

iMATCH: ILP based Monolingual Aligner for Multiple-Alignment at the Chunk Level
We approach the problem of multiple alignment (permitting non-contiguous chunk combinations) by formulating it as an Integer Linear Programming (ILP) optimization problem.We construct the objective function as the sum of all Z S 1 ,S 2 , ∀S 1 , S 2 weighed by the similarity between concat(S 1 ) and concat(S 2 ), subject to constraints to ensure that each chunk is aligned only a single time with any other chunk.This leads to the following optimization problem based on Integer linear programming (Nemhauser and Wolsey, 1988): Optimization constraints ensure that a particular chunk c appears in an alignment a single time with any subset of chunks in the other sentence.Therefore, one chunk can be part of alignment only once.We note that all possible multiple alignments are explored by this optimization problem when S 1 = P owerSet(C 1 )−φ and S 2 = P owerSet(C 2 )−φ.However, this leads to a very high number of decision variables Z S 1 ,S 2 , not suitable for realistic use.Hence we consider a restricted usecase This leads to many-to-many alignment where at most two chunks are combined to align with two other chunks.For iSTS task submission, we restrict our experiments to this setting (since this worked well for the iSTS task), but can relax sets S 1 and S 2 to cover combinations of 3 or more chunks.For efficiency, it should be possible to consider a subset of chunks based on adjacency information, existence of a dependency using dependency parsing techniques.Sim(S 1 , S 2 ), the similarity score, that measures desirability of aligning concat(S 1 ) with concat(S 2 ), plays an important role in finding the optimal solution for the monolingual alignment task.We compute this similarity score by taking the maximum of similarity scores obtained from a subset of features F1, F2, F3, F8, F10 and F11 given in Table 1 as follows: max(F 1, F 2, F 3, F 8, F 10, F 11).During implementation, the weighting term, α(S 1 , S 2 ) is set as a function of the cardinality of S 1 and cardinality of S 2 to ensure aligning fewer individual chunks (for instance, single alignment tends to increase objective function value more due to more aligned pairs, since similarity scores are normalized to lie between -1 and 1) does not get an undue advantage over multiple alignment.This is a hyper-parameter whose value is set using simple grid search.We solve the actual ILP optimization problem using PuLP (Mitchell et al., 2011), a python toolkit for linear programming.Our system achieved the best alignment score for headlines datasets in the gold chunks track.

Type Prediction Module
We use a supervised approach for multiclass classification based on the training data of 2016 and that of previous years (for some submitted runs) to learn the similarity type between aligned pair of chunks based on various features derived from the chunk text.We train a one-vs-rest random forest classifier (Pedregosa et al., 2011) with various features mentioned in Table 1.We perform normalization on the input phrases as a preprocessing step before extracting features for classification.Normalisation step includes various heuristic steps to convert similar words to the same form, for example 'USA' and 'US' were mapped to 'U.S.A'.Empirical results suggested that features F1, F2, F3, F5, F7, F8, F9, F12 along with unigram and bigram features give good accuracy with decision tree classifier.Feature vector normalisation is done before training and prediction.We note that our type classification module performed well for the answer-students dataset, while it did not generalize as well for the headlines and images.We are exploring other features to improve performance on these datasets as future work.

Score Prediction Module
Similar to type classifier, we designed the Score classifier to do multiclass classification using onevs-rest random forest classifier (Pedregosa et al., 2011).Each score 1-5 is considered as a class.'0' score is assigned by default for 'not-aligned' chunks.Word normalization (US, USA, U.S.A are mapped to U.S.A string) is performed as a preprocessing step.Features F1, F2, F3, F5, F7, F8, F9, F12 along with unigram and bigram features (refer Table 1) were used in training the multi-class classifier.Feature normalization was performed to improve results.Our score classifier works well on all datasets.The system achieved highest score on the gold-chunks track for answer-students dataset and For phrasal score, sum editscore of sentence 1 words with the closest sentence 2 words.
Compute the average over scores for words in source.headlines dataset and is within 2% of the top score for all other datasets.

System Chunks Track: Chunking Module
When gold chunks are not given, we perform an additional chunking step.We use two methods for chunking: (1) With OpenNLP Chunker(Apache, 2010) (2) With stanford-core-nlp (Manning et al., 2014) API for generating parse trees and using the chunklink (Buchholz, 2000) tool for chunking based on the parse trees.
For chunking, we do preprocessing to remove punctuations unless the punctuation is space separated (therefore constitutes an independent word).We also convert unicode characters to ascii characters.Output of chunker is further post-processed to combine each single preposition phrase with the preceding phrase.We noted that the OpenNLP chunker ignored last word of a sentence, in which case, we concatenated the last word as a separate chunk.In the case of chunking based on stanford-core-nlp parser, we noted that in several instances, particularly in the student answer dataset, a conjunction such as 'and' was consistently being separated into an independent chunk in most cases, and therefore improved chunking can be realized by potentially combining chunks around a conjunction.These processing heuristics are based on observations from gold chunks data.We observe that quality of chunking has a huge impact on the overall score in system chunks track.As future work, we are exploring ways to improve the chunking with custom algorithms.

Experimental Results
In this section, we present our results, in both the gold standard and the system chunks tracks.We submitted 3 runs for each track.In gold chunks track, -System Chunks -Run 3, we use the OpenNLP chunker, with training data of 2016 alone.Results of our system compared to the best performing systems in each track are listed in Tables 2-9.In both gold and system chunks track, run2 performs best owing to more data during training.Our system performed well for the answerstudents dataset owing to our edit-distance feature that enables handling noisy data without any preprocessing for spelling correction.Our alignment score is best or close to the best in the gold chunks track, thus validating that our novel and simple approach based on ILP can be used for high quality monolingual multiple alignment at the chunk level.Our system took only 5.2 minutes for a single threaded execution on a Xeon 2420, 6 core system for the headlines dataset.Therefore, our technique is fast to execute.We observe that the quality of chunking has a huge impact on alignment and thereby the final score.We are actively exploring other chunking strategies that could improve results.Code for our alignment module is available at https://github.com/lavanyats/iMATCH.git

Conclusion and Future
We have proposed a system for Interpretable Semantic Textual Similarity (task 2-Semeval 2016) (Agirre et al., 2016).We introduce a novel monolingual alignment algorithm iMATCH for multiplealignment at the chunk level based on Integer Linear Programming(ILP) that leads to the best alignment score in several cases.Our system uses novel features to capture dataset properties.For example, we designed edit distance based feature for answerstudents dataset which had considerable number of spelling mistakes.This feature helped our system perform well on the noisy data of test set without any preprocessing in the form of spelling-correction.
As future work, we are actively exploring features to improve our classification accuracy for type classification, which could help us improve out mean score.Some exploration in the techniques for simultaneous alignment and chunking could significantly boost the performance in sys-chunk track.

Figure 1 :
Figure 1: Flow diagram of proposed iSTS system many chunk level alignment, based on Integer Linear Programming.

Table 1 :
Feature Extraction as used in various modules of iSTS system

Table 2 :
Gold Chunks Images

Table 3 :
Gold Chunks headlines

Table 4 :
Gold Chunks Answer Students

Table 5 :
Gold Chunks Overall

Table 6 :
System Chunks Images

Table 7 :
System Chunks Headlines

Table 8 :
System Chunks Answer Students

Table 9 :
System Chunks Overall