Classification of comment helpfulness to improve knowledge sharing among medical practitioners.

Clinical research article summaries called infoPOEMs (Patient-Oriented Evidence that Matters) are emailed by the Canadian Medical Association to family physicians who read them and answer the online Information Assessment Method (IAM) questionnaire which a free form textual opinion ﬁelds to comment on the value or content of the infoPOEM. This article presents results of a relevance evaluation study applied on these comments to automatically determine their helpfulness and consequently the interest of sharing them among the medical community. A dataset of 3,470 manually annotated comments provides a gold standard, containing structural, syntactic, and semantic features taken from the Uniﬁed Medical Language System and IAM questionnaire. Applied machine learning algorithms show a global f-measure improvement of 9.1% when compared to a binary occurrence bag-of-word baseline.


Introduction
The task of opinion mining has gained importance in the last years with our world being increasingly made of posted information with crowds commenting on such information. Such increased volume of crowd comments has led to text analysis research aiming at understanding and clustering the opinions found in those comments (e.g. see recent articles (Mukherjee and Liu, 2012;Turney, 2002;Chen and Zimbra, 2010)) and to help manage interactions within the online community (Huh et al., 2013). An even more recent task is not so much on understanding comments content, but rather on evaluating comments value, impact or helpfulness for the community reading them. Most research addressing this task, as shown in the Related work section, uses comments on product information on Amazon. But the idea of evaluating comments helpfulness can be extended to other contexts such as community learning or sharing of knowledge. In the present research, we look particularly at the community of medical practitioners in the context of reading and commenting on scientific article summaries which are called infoPOEMs R (Patient-Oriented Evidence that Matters). Within this medical community, comments about an in-foPOEM made by one practitioner could be useful to other practitioners regardless of the opinion expressed. The helpfulness dimension is not necessarily correlated with the opinion dimension with typical values being positive, negative or neutral. For example, the comment "good article" certainly has a different helpfulness value than the comment "this is a very good article since it shows for the first time that drug X can be useful in disease Y", even though both comments are positive.
The automatic identification of helpfulness becomes the subject of our research. Practitioners are not interested in reading all comments, only the valuable or "helpful" ones, and an automatic identification of helpfulness would provide a more efficient way for knowledge sharing among them.

Related Work
Assessing the helpfulness of comments made about recreational or informational items has been explored recently mainly for online products or movie reviews. Many studies look at Amazon data, which is perfectly suited for this task since it provides training data readily available. Besides the research work on Amazon data, we also mention one work on peer review in an educational context. We provide pretty extensive details on the features selected and results obtained for the different work to allow us to compare our feature 72 sets and our results to the ones mentioned here.
Within the Amazon studies, the most cited approach by Kim and Pantel (2006) uses machine learning algorithms with text-based features to rank the usefulness of products reviews from the Amazon.com website. They use a dataset of 25,841 reviews from 1,802 products (mp3 players and digital cameras). Their gold standard ranking is based on user responses, provided on the site, to the question "Was this review helpful to you?". Their features are divided into five sets: structural, lexical, syntactic, semantic and metadata. Structural features comprise total token number, number of sentences, average sentence length, percentage of question sentences, number of exclamation sentences, bold and line break html tags. Lexical features comprise tf-idf (Term Frequency-Inverse Document Frequency) of unigrams and bigrams. Syntactic features included percentage of openclass tokens, verb tokens, first person verbs, adjective and adverb. Semantic features comprise occurrences of product features in the review, as well as occurrences of positive and negative sentiment words. Metadata features comprise number of stars rating, difference between given star rating and average star rating of the product. Using these comments' derived features, they use a SVM-RBF algorithm to evaluate features' correlation. Their best result used only three features (comment's length, unigrams and star rating) providing a Spearman rank correlation of 0.66.
Ngo-Ye and Sinha (2012) also looks at Amazon.com reviews, using 2,718 reviews of 11 books. Rather than expanding the set of features, as with Kim and Pantel (2006), they limit themselves to the traditional bag-of-words approach. Their contribution is on dimensionality reduction using the regressional ReliefF algorithm in comparison with LSA, correlation feature selection (CFS) and two other dimension reduction methods. Using both binary occurrences of words and real frequency occurrences as features, they conclude that the use of regressional ReliefF dimension reduction algorithm outperform basic bag-ofword, LSA and CFS on every count. Zhang and Tran (2008) suggests an information entropy-based bag-of-word model to predict the helpfulness of reviews. As training data, they use 9,955 gps and mp3 player reviews from Amazon. Contrarily to Kim and Pantel (2006) who attempted correlation with a gold standard rank-ing, they transform the problem into a binary classification problem using a consumer vote ratio threshold of >60% to consider a review as helpful. They compare their entropy-based method with three classifiers: Naive Bayes, Decision Tree and sequential minimal optimization (Platt, 1998). The resulting performances (77.2% for helpful and 77.5% for non-helpful) for their approach beat Naive Bayes (h:76.2% and n-h:75.2%) and Decision Tree (h:72.3% and n-h:75.3%) but of the same rank as an occurrence-base bag-of-word using the SMO classifier (h:76.1% and n-h:78.0%) when considering both value of the output class.
Other research also take place in other fields like educational peer-review systems. This is the case with Xiong and Litman (2011) who used a featurebased machine learning approach on peer-review assessments from an introductory collegial history class to evaluate their usefulness. They collected 267 comments made on 16 papers which evaluated the quality of the work (facts, clarity, argument structure and so on). While using the previously published features (Kim and Pantel, 2006) as a baseline, they introduced new features like problem localization ("Page 2 says ..."), new lexicon categories (modal verb, negation, positive and negative words, ...) and cognitive-science constructs (praise, problem, summary, solution, ...). The baseline using structural, unigrams and metadata features offered a 0.62 Pearson correlation (0.67 with new features). The context of this research is the nearest to ours as it targets the usefulness of comments for educational purposes.

Applicative context
The Canadian Medical Association (CMA) delivers by email clinical research article summaries called infoPOEMs (Patient-Oriented Evidence that Matters) to family physicians around the country. To transform the reading of infoPO-EMs into an actual learning experience, research in education states the importance of having the reader (learner) reflect on the value of his reading by answering questions. While questions on the content only test short-term memory, questions on the value of the information for clinical practice can stimulate reflective learning. The impact of such practice and its validation have been researched in depth (Grad et al., 2006;Grad et al., 2008;Pluye et al., 2010a;Pluye et al., 2010b).
As part of their mandatory continuing education 73 program, physicians can answer an online questionnaire called Information Assessment Method or IAM (Grad et al., 2011). It contains many questions to gauge the impact of the infoPOEM's content on the physicians knowledge and practice: "Is your practice changed and improved?", "Are you motivated to learn more?", "Are you dissatisfied?", "Is this summary relevant for at least one of your patients?". In addition to the predefined questions, physicians can add comments about their reading experience targeting the quality of the overall infoPOEM information, the research, the methodology, and so on. The examples below illustrate how physicians' comments can fluctuate in length, content and targeted issues. Comments can be related to missing information (1, 5), generic appreciation (4), critical disagreement (2), agreement with support (3), contextual information about inapplicability of the information (6), etc.

Methodology
Our research takes a similar approach as Kim and Pantel (2006) and Xiong and Litman (2011) on feature extraction and machine-learning, while looking at a closed system without clear "wisdomof-the-crowd" indicators. We evaluate the impact of features based on textual analysis of the comment itself, but also features based on a comparison between the infoPOEM and the comment, as well as features relying on external domainspecific resources. Our methodology consists of (1) circumscribing the data and developing a gold standard, (2) defining a set of features that will best describe the data to be categorized, (3) experiment with machine learning approaches for categorization and (4) perform an evaluation using the gold standard.

Dataset and gold standard
The gold standard was annotated by three medical students with different experience levels. They were asked to read anonymous comments submitted by physicians and indicate if they found them valuable for their knowledge or practice. 1 Each annotator was provided with a list of anonymous comments and their associated in-foPOEM for reference. They could access, if needed, the full text of the infoPOEM if the comment was not clear to them. A preliminary annotation phase was done with 300 randomly selected comments to be annotated by the three annotators (100 each). This phase provided a better understanding of the problem to validate the annotation schema used for the main annotation task. The classification schema included three choices to annotate the helpfulness of a comment: "valuable", "non-valuable" or "I don't know". The annotators were asked to consider each comment independently and not let the reading of previous comments influence their choice.
The main annotation task was based on two batches of comments. A first one, relatively small, contained 250 comments and was given to all three reviewers and allowed us to calculate an interannotator agreement. A larger set of comments was split in three parts to have each comment annotated by a single reviewer. This provided a total of 3,470 comments associated with 327 randomly picked infoPOEMs. Of these comments, 1,586 (45.6%) were deemed valuable and 1,884 (54.3%) non-valuable. A dozen comments were tagged "I don't know" and removed from the dataset.
The 300 comments from the preliminary annotation step joined with the 250 comments for the inter-annotator agreement were used as the development dataset (550 unique comments) to define, develop, test and refine features presented in the next section but were not used in the dataset for the final evaluation. The other set of 3,470 comments was used as the evaluation dataset for performance assessment.
The size of the manually annotated dataset compares advantageously to the 1000 annotated comments of Ghose and Ipeirotis (2007) and the 267 of Xiong and Litman (2011). Using the first 250 comments annotated by the three annota-tors, an inter-annotator agreement of 0.4846 was computed using the Fleiss' Kappa method for multiple annotators with all three classes (valuable / non-valuable / i don't know). The interannotator agreement was recalculated using only the 247 comments with only the two main classes (valuable/non-valuable) which provided a score of 0.5004. The remaining data shows a stronger agreement on valuable comments than on nonvaluable ones. The level of agreement calculated on this dataset is considered moderate according to Landis and Koch (1977) when compared to pure chance agreement and is of the same order as in Xiong and Litman (2011). Using each annotator as the gold standard versus others, the f-measures were 0.806 between annotators 1 and 2, 0.783 between 1 and 3 and 0.792 between 2 and 3.
The reason behind the average ratings for interannotator agreement score can be explained by one or many of the following points: coding instructions were interpreted differently by each annotator, coding decision is based on factors which are not present in textual data (like relevant prior knowledge, expertise domain or interest, personal taste or bias and so on), decision factors were present in the text but not correctly understood by the readers, etc. While it is difficult to provide a clear and proven diagnosis of the reason behind these scores, lower scores usually increase the difficulty to develop prediction systems. As such, the average agreement provides a contextualisation of potential performance for this task; a near-perfect classification of comments is not the goal as it would overfit the three annotator's classification.

Feature definition
The purpose of defining features is to capture as well as possible the characteristics of comments which would be representative of their helpfulness character. Inspired by previous research, we define a set of base features, focusing on standard text analysis techniques. But we apply these techniques not only to the comment's content itself, but also in a comparative setting looking at similarities between an infoPOEM and its comments. We present these base features first. Second, we look at metadata features from the infoPOEM itself. Third, we use the actual IAM questionnaire as a source of features. Fourth, inspired by our specific problem being in the medical domain, we define a set of features using a medical resource, the UMLS (Unified Medical Language System). The feature extraction process was developed using GATE (Cunningham et al., 2011) with part-ofspeech TreeTagger (Schmid, 1994) tool.

Base
The base set includes all features extracted using natural language processing techniques. It includes features and their representations used in previous researches like Kim and Pantel (2006;Xiong and Litman (2011) as well as new ones introduced in this article. They can be regrouped in the structural, syntactic and semantic subsets.
Structural Structural features target statistical properties of tokens contained in the comments. The total number of each one was added as separate features. Two features were also added for tokens: the standard deviation and a three-value discretization of the standard deviation to account for the length being within range of the average (avg) number of tokens of all comments, above (high ) or under (low) it, using ±1σ as the threshold.
Syntactic Following a part-of-speech tagging (attributing a syntactic role to each word), the number of stop words and content words were added as features, which summed up to the number of tokens from the structural feature. The standard variation and its discretization (as seen previously) were also added. The first and second person pronouns (ex: I, we, us, etc) were added as total count and binary occurrence (true if any occurrence are observed, false if none) features to the dataset to identify author related comments like accounts of personnal experiences, thoughts, preferences or opinions.
Then for each type of content words (verb, adverb, noun, adjective) found both in the comment and the corresponding infoPOEM, we added four similarity-based features. They were the total count of similar occurrences, the binary occurrence, the ratio between the total count and the total number of content words and finally the ratio between the total count and the total number of words.
Semantic To identify comments with strong opinions or impressions, we use specific verbs (e.g. admit, enjoy, deem, endorse, decline, concern, advise, ...) and match the infinitive form of these verbs in the comments following a partof-speech tagging step. Negative indicators (not, never, neither, nor, can't, don't, etc) are also annotated to target potentially critical comments. As the comments were on infoPOEMs within a scientific discipline, terminology related to the scientific method (observation, qualitative, inference, ...), the statistical domain (population, marginal variable, match sample, ...) and to measurement (unit, cm, m, mg, ug, kg, ml, ...) were added separately as features. Finally, the five standard section's labels (title, clinical question, bottom line, study design, synopsis) from the infoPOEM were added as keywords to detect if a text was commenting on the specific section of the infoPOEM.
The number of instances and the binary occurrence for each of these semantic concepts (opinion verbs, domain terminology, negative indicators and localisation indicators) were added as features.

Metadata
To each infoPOEM is associated a code called the level of evidence (LOE). This code describes the type of research protocol used in therapy, diagnosis or prognosis research using one letter and one number (1a, 1b, 1c, ..., 2a, 2b, ...). A minus sign can be added at the end of the code to denote researches that cannot provide conclusive answers in cases where the confidence interval is too large or the heterogeneity of the population's sample used is problematic. We use this code and split it in 3 parts to provide 3 features: the type (first character, from 1 to 5), the subtype (second character, from A to C) and the presence of the minus indicator.

IAM
Each question from the IAM questionnaire was added as a feature. Most of the questions asked for a logical yes/no answer. A few questions accepted either yes, no or "possibly" as an answer. Only one question pertaining to the relevance of the information regarding the physician's patients, asked for an answer using three levels: "totally relevant", "partially relevant" or "not relevant". The possibility to answer some specific questions was also dependant on the answer on previous questions; i.e. questions #3 and #4 were only available if the totally or partially relevance was chosen at question #2. Regardless of this factor, all questions were added as individual and stand-alone features in the dataset.

UMLS
Unlike the work with Amazon data which relies on official product feature sources to find vocabulary representative of different products, we do not have access to such sources in this study. Instead, we extracted single words and multiword expressions from the Unified Medical Language System, a large medical ontology hosted at the National Library of Medicine (http://umlsks.nlm. nih.gov/) to analyse the domain specific nature of the reviews and infoPOEMs. The relevant part of this resource splits biomedical and related concepts into 13 groups and 94 types using themes like genes and molecular sequences, anatomy, living beings, physiology, procedures, disorders, organizations and so on, with each type related to one group.
For each type and group, the number of occurrences, the binary occurrence and the similarity occurrences were added as features. The similarity occurrence indicates how many expressions found in a comment were also found in the in-foPOEM related to that comment. This type of feature was added to verify if an author was talking about domain-specific concepts from the in-foPOEM. Because of the relation between groups and types, each matching expression was both represented with a type feature and its corresponding group feature. In addition, the global binary and total occurrence of UMLS expressions were added as two features to logically regroup all UMLS type and group features. Therefore, if a word was tagged as being part of 4 types and 3 groups, the global binary occurrence would be 1 and the global number would be 7.

Baseline
Two baselines were created using the bag-of-word model applied to the whole set of annotated comments. The preprocessing included a tokenizer, an English stop word filter and the Snowball English stemmer, using each stemmed token as a feature in the dataset. The bag-of-word baselines have been extracted and tested using the RapidMiner tool (Mierswa et al., 2006).
The first baseline follows the best replicable results from Kim and Pantel (2006) using the length in token of the comment and a unigram bag-ofwords. The stemmed tokens were then weighted using the tf-idf measure. The resulting dataset was 76 processed with the SVM-RBF algorithm which provided a f-measure score of 63.6%.
The second baseline is based on the conclusion of Zhang and Tran (2008), which presented a method providing a weighted score equivalent to the SMO algorithm applied to a binary occurrences bag-of-words. The SMO (sequential minimal optimization algorithm for training a support vector classier) and binary bag-of-words method performed on our comment corpus yielded a 62.2% f-measure which is significantly lower than the 77.1% f-measure averaged from the helpful and not helpful classes using their product review dataset.

Helpfulness prediction
As the helpfulness evaluation was based on few annotators instead of large population like on Amazon, classification algorithms were used to predict the correct value instead of a rank correlation method. We used the ten-fold cross-validation to provide the recall, precision and f-measure estimation for each one. The feature sets were then combined with one another to verify which group gave the best results. Evaluation of machinelearning algorithms has been made using the Weka toolset (Hall et al., 2009).
To be able to test the relative strength of the feature sets from section 4.2, four datasets were created using the following feature sets: base (B), base and IAM questionnaire (B+I), base and UMLS (B+U) and all three sets together (B+I+U). The three metadata features they were included in the base feature set (B). Finally, as the UMLS set contains a large number of features, a fifth dataset (B+I+U-150R) was created using a smaller subset of features which were selected following Ngo-Ye and Sinha (2012) study, using the Relief-F algorithm to select the 150 top features, excluding the output class.
For each of the five datasets, a single algorithm from the four main families was tested: BayesNet (Friedman et al., 1997) (baysian), Voted Perceptron (Freund and Schapire, 1998) (function), JRip (Cohen, 1995) (rule-based) and Logistic model trees (Landwehr et al., 2005)(decision tree). Table 1 provides an overview of each algorithm applied on each of the 5 datasets with the corresponding f-measure. Numbers in bold indicate the dataset on which each algorithm best performed.The base feature set did better than the two baselines, increasing prediction quality by 2.7% and 4.1% respectively to 66.3%. Adding the UMLS to the base feature set (B+U) marginally increased the performance by 0.4%. The IAM feature set, when joined with the base (B+I), did better with 70.4%, an 1.4% increase. Finally, the dataset with the highest results is the Relief-F selected subset with a top scoring f-measure of 71.3%, which is an improvement of 0.9% over the 70.4% using the complete dataset (B+I+U), both attained with the voted perceptron algorithm.

Features relevance
The average absolute weight of each feature from the voted perceptron algorithm applied on the feature set B+I+U provided a ranking from the most discriminative feature to the least, for which the first 24 for each class are shown in Table 2. The first column, for the positive class (valuable), shows that the number of tokens still makes the top of the list with the five first features under different forms: number of content, any or stop tokens, percentage of similar (sim %) content tokens and number of sentence. The group (grp) and type (typ) UMLS features occupy almost half the list (10 out of 23) using similarity (sim) count, binary occurrence (bin) and number of occurrences (nbr). Five questions from the IAM questionnaire are also in the list, with three top ones being negative assessment from the physicians.
The second section shows for the negative class (non-valuable) that standard deviations (stddev) related to token length are the three most relevant features. 14 out of 24 features are from the UMLS resource with two-third (9 out of 13) using the similarity count. Three questions from the IAM questionnaire are also used. The rankings for the two output classes show that while length of comments is still a significant aspect of perceived value, features from the UMLS dataset are ranked high for their discriminative power. The similarity aspect is also used in half of the UMLs features 77 (11 out of all 23 UMLS features) which indicates the preponderant usefulness of this aspect over the other like binary occurrence and basic count.
An interesting observation on Table 2 is the type of IAM questions which prompt positive and negative value for the medical community. The top three IAM features used for the positive class (valuable) are from negative questions: "'I disagree with the content"', "'There is a problem with this infoPoem"', "'Not enough information"'. The negative class exhibit the same occurrence, as questions like "'I learned something new"', "'Reminded of something I already knew"' and "'I learned something new"', which are supportive of the article, are used by the algorithm as highly discriminative features. This may suggest that comments shedding a negative view on article's comments are considered more relevant than supportive ones. This would supports the brilliant-butcruel hypothesis (Amabile, 1983).

Applicability
While the experiments results show good improvement over previous methods, enforcing the straight-out application of the trained classification model might not be advisable in the spirit of knowledge sharing in a continuing education program. As previously shown, the average interannotator agreement might be used to seek a more lenient classification method to select which comments are to be shared among the physician's community and which are to be removed. One path to explore in this context is the opportunity that each algorithm can provide a confidence level which expresses the certainty of the algorithm regarding the chosen prediction class. It is usually based on the similarity rating between the features in the assessed entry and the ones in the trained model.
These results were generated using the voted perceptron algorithm. The dataset used for training was the 300 comments from the first step, combined with the 250 comments used for the inter-annotator agreement evaluation, using a majority vote to choose a relevant class. Finally, 450 randomly picked comments from the main dataset (3,470 comments) were added to these comments to provide a 1,000 comments dataset to the voted perceptron algorithm. The remaining 3,020 comments (3,470 minus the 450 retained for training) from the main dataset were used for evaluation purpose. Table 3 shows the results for levels of confidence ranking from 75% to 95% for each individual class by providing the total number of comments classified as such, the number of errors (wrongfully classified comments) and the resulting prediction ratio. For example, the algorithm at a confidence level of 80% classifies 575 comments as being non-valuable. In these 575 comments, 65 were in fact classified by the annotator as belonging to the other class, resulting in a 88.7% success ratio, which is significantly higher than the best result from Table 1. It can be observed in this table that while the non-valuable class provides a better success rate at the 95% confidence level, they all degrade to approximately 83% at 70% confidence rating. The non-valuable class shows better results but classifies a smaller amount of comments at the two higher confidence levels. It then drops to the same level for lower confidence than the valuable class, but progressively classifying more comments.
These new results could be used in two main scenarios for knowledge sharing among the medical practitioners. The first scenario is to use an arbitrarily chosen confidence level (for example, 80%) to filter out most of the non-valuable comments from the dataset. Physicians could then browse the remaining comments which would have a higher chance of being helpful. The sec-  5  143  1  0  23  95.83  5  120  90  94,46  19  343  97,56  3  123  92,73  16  220  85  91,32  57  657  92,05  26  327  90,61  31  330  80  88,10  122  1,025  88,70  65  575  87,33  57  450  75  86,20  191  1,384  86,11  115  828  86,33  76  556 ond scenario is to use the confidence level as a ranking for all comments. Comments classified as valuable with a high confidence rating would be presented at the top of the list, followed by comments classified with a slightly lower confidence score and so on. This second scenario provides more flexibility in an education context. The bottom of this list would be the top confidence scored comments for the non-valuable class, which could still be offered. This second scenario provides more flexibility in an education context. Even if most readers would only look at the top of the list, the curiosity driven readers would have access to the complete listing of comments which could be useful for topics relevant to their practice.

Discussion and Conclusion
Our applicative setting is one of information sharing in a context of continuing education for medical practitioners. This is certainly far from product review, but still the same problem exists that many comments are made by users, and these comments are not all useful to other users. Nevertheless, because of this applicative difference, performance comparison with other publications is not straightforward as we are not in a typical social mediabased interactive setting, which means that typical data like star rating (used in Kim and Pantel (2006)) and relationship between reviews (like for Zhang et al. (2012)) is not available. Still, it can be observed in our performance evaluation that the basic unigram bag-of-word approach did not perform as well as our more complex features.
The UMLS features did not improve the overall performance when coupled with the base or the base+IAM questionnaire feature sets, probably because of the less relevant features which made data noisier. This is correlated with the increase of performance seen when the complete dataset was reduced to the 150 most discriminative features with the Relief-F feature selection algorithm. The IAM questionnaire, which physician are not obliged to answer completely, may suffer from the same problem of missing information on star rating for new products. Even if it is successfully used by classification algorithm, other features should be prioritized when possible. This could lead to more stable performances which are not dependent on the completeness of an external source of information.
As seen in the top negative features in Table 2, standard deviation was a useful measure for classification. The discretization of these features was also useful for both output classes. Although, while length can be a good predictor as shown in previous study (Kim and Pantel, 2006) and allows to discard useless short comments ("very good", "thanks", etc.), it will undeniably wrongly classify comments like "N?" (indicating the missing population size of a study) as useless. This is an extreme difficult case.
In conclusion, this research explored a textual feature extraction process with machine-learning classification to predict the helpfulness of comments in a context of continuing education for family physicians. Our research is well anchored in previous research on the topic, and we make further contributions by introducing similarity-based features to compare comments and infoPOEMs. We also introduce the use of an external domain specific resource to provide a measure of domain appropriateness for the comment, which is playing a role in its evaluation of helpfulness.
We showed that our method improved two previous baselines by 7.7% and 9.1% to a final 71.3% with the voted perceptron algorithm applied over a dataset of 150 features selected using the Relief-F algorithm. Since the categorization is far from perfect even if it gives good results (far above chance), we also suggested two confidence-based scenarios to make the categorization applicable in a real-world knowledge sharing context among medical practitioners. 79