Learning Comment Controversy Prediction in Web Discussions Using Incidentally Supervised Multi-Task CNNs

Comments on web news contain controversies that manifest as inter-group agreement-conflicts. Tracking such rapidly evolving controversy could ease conflict resolution or journalist-user interaction. However, this presupposes controversy online-prediction that scales to diverse domains using incidental supervision signals instead of manual labeling. To more deeply interpret comment-controversy model decisions we frame prediction as binary classification and evaluate baselines and multi-task CNNs that use an auxiliary news-genre-encoder. Finally, we use ablation and interpretability methods to determine the impacts of topic, discourse and sentiment indicators, contextual vs. global word influence, as well as genre-keywords vs. per-genre-controversy keywords – to find that the models learn plausible controversy features using only incidentally supervised signals.


Introduction
Online discussion comments are exchanged in parallel, creating redundancy that prohibits discussions from developing beyond a superficial stage of confirming previously held opinions. Instead, Mahyar et al. (2017) recently demonstrated that focusing users on controversial comments -i.e. comments that cause inter-group agreement-conflicts  -helps speed up inter-group consensus finding leading to improved group decisions. However, their system (ConsensUS) uses manual controversy labels which can not capture rapidly evolving comment-controversy at scale or over diverse domains. Hence, to fully automate commentcontroversy prediction systems we contribute the following solutions to a number of challenges. (I) We extend controversy prediction to commentlevel, and to German news discussions. We evaluate topic, sentiment and discourse importance (Cramer, 2011) and analyze whether models plausibly capture controversy aspects using explainability methods (see Sec. 5.3). (II) We use comment vote-agreement to create an incidentally supervised (Roth, 2017) controversy signal as seen in Figure 1. Structural (output feature) signals like genre, are predicted by a sub-encoder (see Sec. 4) rather than required as input. (III) Sentiment and discourse input feature creation work on any tokenizable language (see Sec. 3).

Related Research
Since predicting user agreement-conflicts upon web news comments is a special case of controversy prediction, we list in the following related works that: (a) learn to predict controversy, using (b) incidental supervision, and (c) work on online (news) discussions. Chen et al. (2016) visualized controversial words using dissimilarities in pro vs. contra argument embeddings. Garimella et al. (2017) identified controversial topics using bipartite Twitter follower-graphs, while Dori-Hacohen and Allan (2015) proposed an incidentally supervised binary classification to predict controversial topics via Wikipedia tags. Jang et al. (2016) used language modeling to predict controversial documents, based on earlier hypotheses by Cramer (2011): "that language in news discussions is a good indicator of controversy". Choi et al. (2010) focused on using sentiment polarity indicators and subtopics, i.e. topically related phrases of nouns. Vote-based learning signals have been exploited by both Pool and Nissim (2016); Basile et al. (2017) who predict the sentiment distributions of news outlets or find controversial news pieces using Facebook-article emoticon-votes. Instead of predicting controversial topics (articles), we predict controversial comments, hence putting the focus on users (commentators) as curators of controversial content.

Incidental Supervision Signals
Controversy signal: We use comment voteagreement ratios and news tags as incidental supervision signals (Roth, 2017) to label comments as controversial and by genre. Comments without a clear 2 /3 majority of either agreeing (up) or disagreeing (down) votes are considered controversial -i.e. of conflicted agreement. The ratio is calculated as r = min(up, down)/max(up, down). Ratios below 0.5 mark a 2/3 majority. Ratios above 0.5 mark conflicted agreement. We reduce labeling noise via two noise margins: (a) controversial comments must have a vote-ratio r > 0.6 and (b) that both the up-votes (group) and down-votes (group) should each have more than 2 votes. Article Genre signal: Predicting controversy without context structure is difficult, hence we use article-genre (topic) prediction as an incidental structure signal. The data contains 15 genres -some of which are noisy mixes of others. However, to keep preprocessing general, we use genres "as-is". Corpus: We collected comments and the above training signals for every article published by the Austrian newspaper DerStandard.at in 2015. Each article has a news genre tag and user comments, that in turn receive up and down votes. The corpus contains 813k comments, from which we extracted 8.9k controversial and 12.6k non-controversial comments after removing duplicates and short comments with less than five words. Text preprocessing: is source agnostic without languagespecific NLP. We remove noise like low-frequency words. We create special tokens for discourse (repeated punctuation) and reactionary sentiment (emoticons) by categorizing emoticons into four (non-overlapping) types using a Wikipedia emoticon list 1 , see Table 1. We keep stop words, as they 1 https://en.wikipedia.org/wiki/List_ of_emoticons often overlap with discourse markers (see Sec. 5). Compounds are separated with a $comp$ token. Finally, we pre-trained word2vec (Mikolov et al., 2013) embeddings on 3.35M preprocessed article and comment sentences to cover standard German and mixed (non)dialect.
Single / Multi-task CNNs : We also use convolutional neural nets (CNN) as they are widely used in text classification. Below, we describe how we modified the single-task model (ST) by Kim (2014) to create a multi-task architecture (MT) as follows. ST: A CNN that predicts commentcontroversy only. It uses a deeper classifier, inputtoken dropout, custom word2vec embeddings and trains on comment, controversy label pairs via a binary cross-entropy -see Controversy CNN in Figure 2. MT: This model adds a genre-encoder to the ST. The encoder predicts multi-class genre via categorical cross-entropy and softmax on genre labels. Its penultimate activation map is fed to the ST's controversy classifier, to provide genre plus controversy features -see red downward arrow entitled genre encoding in Figure 2. The two losses are trained as a weighted sum. Thus, genre features are not required when predicting on new data.
MT modifications: Since feature extraction module design is central to CNNs, we evaluate a range of different design choices. We separate extraction modules into three categories from left to right: convolution methods, activation schemes, and pooling mechanisms as seen in the upper and middle parts of Figure 2 et al., 2000). Swish: Self-attention multiplying inputs x by their sigmoid σ(x) (Ramachandran et al., 2018). Squeeze and Excite (SE): Bottlenecked multi-layer attention that learns convolution filter importances (Hu et al., 2018). MaxPool: (LeCun et al., 1998). Max(SPool)*: Appends perfilter Standard Deviation Pooling (SPool) to Max-Pool, to preserve variance info. In the next section we evaluate the most successful combinations.

Results and Discussion
We evaluate on 8.9k controversial and 12.6k noncontroversial comments that each belong to exactly one genre. We created 5 randomly sampled (stratified) folds -4 folds for cross validation (CV) and 1 as holdout set. MNB, LR, FT, Conv+ReLU (ST) only predict controversy. The MT models jointly predict controversy + genre and are tested for various modification combos. Finally, we investigate models decision semantics and feature type importances via ablation studies.

Baselines: MNB, LR, FT
In Table 2 we list F 1 , area under the ROC curve (AU C) and accuracy (Acc) controversy prediction results on the holdout test set. We see that FastText is the best baseline 2 . Optimal hyperparameters from 4-fold CV were: word-embedding 1-3gram with 128 dimensional w2v embeddings for FT, and TFIDF 1+2grams with a maximum documentfrequency of 100% and a minimum term frequency of 2 for MNB and LR.

ST, MT CNNs
Stopwords and punctuation are kept as they contain discourse and sentiment features -see sec. 5.3 for details. Low-frequency words are replaced with a pre-trained unknown word token (UNK). Conv+ReLU (ST): The controversy-only CNN outperformed FT at optimal CV parameters of: 1-5gram, global max pooling, 128 filters and 1k classifier widths. More filters or a 4k width decreased CV and test performance. Standard dropout (Hinton et al., 2012) and Batch Normalization (Ioffe and Szegedy, 2015) decreased performance, while 20% token-dropout (Gal and Ghahramani, 2016)

Feature-type ablation
We ablated sentiment, discourse and topical features (Choi et al., 2010;Cramer, 2011). Then, we re-tuned the Conv+ReLU (MT) on the 4, now ablated, CV folds to measured test set performance changes as follows. Three sentiment ablations: (1) polarity words (sent ws by Waltinger (2010)), (2) repeated punctuation (punct.), and (3) emoticons (emotes) as mentioned in sec. 3. Discourse: Removal of German discourse markers (DiMLex) (Stede and Umbach, 1998). Topic: Noun removal as in Choi et al. (2010) to represent topical indicators. Figure 3 shows the relative percentual performance drop per ablation. Thus, for controversy prediction: topic was the most important, followed by discourse markers 3 and sentiment with repeated punctuation and emoticons being impactful style/sentiment features. Polarity words affect prediction, but are not language independent.  Figure 3: Relative controversy prediction performance drop in % for removal of: sentiment (blues), discourse (orange) and topic/nouns (red).

Per-word impacts
Inspired by explainability methods (Li et al., 2016; we also measured the controversy prediction-score change when replacing a token with a class neutral UNK token 4 .

Discourse or punctuation ($):
Because it not_a UNK country but a dictatorship is . What UNK Putin of human_rights and peace $.$ . Had you the UNK or are you vaccinated $?$ ? ii ii

Context dependent word influence:
Interestingly , if one something negative against ⏎ Windows posts will one instantly_be with UNK ⏎ bombarded . But 2 years were we by Microsoft marketing ⏎ and by Microsoft fan_boys UNK how cool yet not ⏎ Windows 8 and 8 .1 is . In Figure 4 we colored per-token score drops (red) or increases (blue) for German-to-English word-by-word translations on test set comments.
We show examples by ablation types as described in section 5.3. As before, nouns and discourse markers increase controversy, while, expectedly, an (#unserious) ;) emoticon is strongly counter-indicative of controversy. Repeated punctuation, like $.$ or $?$, also impacts prediction. Finally, the model learned context dependent con-  troversy polarity for the word Windows, with has both strong positive and negative polarity.

Token impacts on genre and controversy
To generate keywords for controversy and genre vs. controversy-per-genre, we averaged UNK token-replacement prediction-impacts over all occurrences of a token t i and calculated its impact mean µ(impacts(t i )) and standard deviation σ(impacts(t i )), similar to how Horn et al. (2017) extract topic keywords.
Controversy keywords: In Table 4 we divided tokens into infrequent (top half) and common tokens (lower half). Infrequent tokens have over 10 occurrences, frequent ones at least 200.
(a) 0 con (b) ↑↓ con (c) ↑ con (d) ↓ con   The tokens impact controversy either: (a) not at all, (b) positively or negatively, (c) generally increase it or (d) generally decrease it. We see that, standard punctuation has no impact on controversy (a), but repeated punctuation, emotes and political terms do (b). Expectedly, political terms generally increase controversy (c), while colloquialisms and friendly emotes lower it (d).
Genre vs. controversy-per-genre keywords: We examined mean token impacts µ(impacts) on genre classification vs. per-genre controversy in Table 3 for the four most interesting genres. The domestic politics genre is dominated by established Austrian parties or generic political terms, while right-wing, left-wing and liberal parties characterize domestic controversy. The international genre shows mostly war related terms. Its controversy focuses on the 2015 Ukraine and middle east conflicts. Keywords for the economy genre are general finance terms, whereas the Greek debt crisis dominates genre controversy. The panorama genre focuses on refugee-related terms, where the related right-wing issues caused controversy in 2015.

Conclusion
We proposed a fully automated, incidentally supervised, multi-task approach for commentcontroversy prediction and showed that it successfully captures contextual controversy semantics despite only using minimal, language independent, preprocessing and feature creation. In the future, we aim to extend data collection to study controversy drift over time.