NLP@UIT at SemEval-2019 Task 4: The Paparazzo Hyperpartisan News Detector

This paper describes the system of NLP@UIT that participated in Task 4 of SemEval-2019. We developed a system that predicts whether an English news article follows a hyperpartisan argumentation. Paparazzo is the name of our system and is also the code name of our team in Task 4 of SemEval-2019. The Paparazzo system, in which we use tri-grams of words and hepta-grams of characters, officially ranks thirteen with an accuracy of 0.747. Another system of ours, which utilizes trigrams of words, tri-grams of characters, trigrams of part-of-speech, syntactic dependency sub-trees, and named-entity recognition tags, achieved an accuracy of 0.787 and is proposed after the deadline of Task 4.


Introduction
Fake news is a noteworthy term in recent years. The rise of users and rapid spread information on social networking have made on automatic controlling of fake news more difficult. Fake news articles are typically extremely one-sided (hyperpartisan), inflammatory, emotional, and often riddled with untruths (Potthast et al., 2018). The influence of misinformation varies depending on the style it is written in. For example, sarcasm in a sports news article will have less of an impact than news written in the hyper-partisan argumentation style, which can sway voter decision in an election.
Hyperpartisan detection in news articles is one of the ways to control fake news on the media and public. Kiesel et al. (2019) provided a new task, which they name "Hyperpartisan News Detection," to decide whether a news article text follows a hyperpartisan argumentation. We approach this task following traditional text classification by extracting style features. The bag-of-words model is the way of text representation and is applied to sentiment analysis effectively (Pang et al., 2002). Matsumoto et al. (2005) applied text mining techniques on dependency sub-trees as features for sentiment analysis at the document level. Our results show that n-grams of words and dependency sub-trees features from sentences of the document have certain impacts on the performance of the classifier. The details of the features in our systems and the results are described in Section 3 and Section 4.

Task Description
SemEval2019 Task 4 has only one task, in which participants are required to build the systems for hyperpartisan news detection. The task is to predict which category ("hyper-partisan" vs. "not hyper-partisan") an argumentation belongs to when given the news article in English (Kiesel et al., 2019). There are 645 articles in the for-ranking training set, and 628 articles in the for-ranking testing set (all of them are labeled through crowdsourcing on an article basis). Besides, the organizers of this task provided another dataset with the training/validation/testing set having 600,000/150,000/4,000 articles (all of them are labeled in accordance with the judgment of the publisher). The organizers use the accuracy as the main metric in the for-ranking testing set to evaluate the performance of the participants' systems. All submissions and results are validated by the organizers via the evaluation service TIRA .

Data Preprocessing
Data preprocessing of the given input is the important phase for every task related to natural language processing. The input of SemEval-2019 Task 4 is an XML file, containing a title and many paragraphs in the body text. Paragraph segmentation is based on the HTML ăpą tag because the ăpą tag defines a paragraph. While many paragraphs are wrapped by the ăpą tag, some are not. Observation of some inputs from the dataset shows that paragraphs that are not wrapped by any HTML tag may contain "noise," such as advertisements and the browser's error messages. On the other hand, texts displayed in HTML ăpą tags can also contain "noise," such as notifications for redirecting a page (e.g., "Click here to..."). We did not handle the aforementioned noises in our experiment. The next step after paragraph segmentation is sentence segmentation. During this process, we used spaCy tool (Honnibal and Montani, 2017) to extract sentences from titles, HTML ăpą tags, and paragraphs not wrapped in any HTML tag of input (as we can see the diagram in Figure 1).

N-grams of words
Before extracting n-grams of words, we break the sentences into words in three ways: 1. WS 1 : The sentence is split by space/multispace into tokens.
2. WS 2 : The sentence is split by space/multispace into tokens. After that, we discard tokens which are punctuations or English stopwords.
3. WS 3 : The sentence is segmented into words. And then, we lemmatize words into lemmas. All is done by using the spaCy tool (Honnibal and Montani, 2017).
After splitting/segmenting the sentence into tokens/words, we put tokens/words are all in lowercase and implement extracting n-grams of them. The specific values of n for prediction models are mentioned in section 3.3.

N-grams of characters
Extracting n-grams of words is effective for text classification that is word-based representation, but this approach requires reliable tokenizers for breaking the sentences into words. Experiments on unsolicited e-mail messages (spam) and a variety of evaluation measures, Kanaris et al. (2007) show that n-grams of characters are more reliable to classify texts than n-grams of words. Potthast et al. (2018) show how a style analysis can distinguish hyperpartisan news from the mainstream, and they also use tri-grams of characters as features for the classifier in their experiments. As we described in section 3.1, unfortunately, the input of SemEval-2019 Task 4 contains a small number of strange n-grams of characters towards the tokenizers. Therefore, we decide to use n-grams of characters as the features in our system. We use the sentence with all of its tokens being rejoined after the segmentation in WS 1 (we described in section 3.2.1) with character space for extracting ngrams of characters. In our experiments, the value of n ranging from 2 to 7 and the specific values for prediction models are mentioned in section 3.3. Argamon et al. (2003) found that n-grams of partof-speech can efficiently capture syntactical information and gender-based style of the writer. Potthast et al. (2018) used tri-grams of part-of-speech to make a comparative style analysis of hyperpartisan (extremely one-sided) news and fake news. Although the efficacy of using n-grams of partof-speech on fake news was not examined in their study, we decided to experiment by using n-grams of part-of-speech as features for hyper-partisan news detection. We used the spaCy tool (Honnibal and Montani, 2017) for part-of-speech tagging and extract tri-grams of part-of-speech as features.

Sub-trees of dependency tree
In our experiment, dependency parsing involves extracting from a dependency tree a dependency sub-tree, which is defined by Matsumoto et al. (2005) as "a tree obtained by removing zero or more nodes and branches from the original de-  Figure 2: Visualization of the dependency tree of the sentence within the bracket ("She's the one, and PER X, that caused the violence," PER Y said.). This sentence is taken from a news article of the training for-ranking training set which is mentioned in Section 2. The person's name is replaced by PER {uppercase letter} in this example (we did not do that in our experiment). pendency tree." Figure 2 illustrates a dependency tree of a sentence parsed with spaCy tool (Honnibal and Montani, 2017), and its shortcoming that shows the double quotation mark on the left does not have any child node or parent node. This shortcoming, however, did not affect the extraction of all sub-trees of the dependency tree, but we resolved this issue by considering each group of subtrees as one connected component, and the dependency tree as a graph that can contain more than one connected component. Figure 3, the number of nodes in a sub-tree can range from 2 to 4, and NetworkX tool (developed by Hagberg et al. (2008)) was used to extract all the sub-trees of the original dependency tree as one connected component for each node. All words at each node of sub-trees are lemmatized in our experiment. As we can see in Figure 3, some sub-trees can capture words which are not located close to each other.

Named-entity recognition tags
Characteristics of the input of SemEval-2019 Task 4 contains names of people, names of organizations. Therefore, we decided to use mentions of specific terms in named-entity recognition as features. In our experiments, a feature is represented by concatenating a mention and a named-entity recognition tag. We used the spaCy tool (Honnibal and Montani, 2017) for the named-entity recognition task.

Prediction Models
In this section, we describe the four models which we have summited to the organizers. In all models, we used linear SVM (SGDClassifier from Scikitlearn (Pedregosa et al., 2011)) as the classifier, and the loss function which is hinge loss with L2 regularization. In all models, we did not run validation experiments for turning regularization term α of all models. We used just the default value of α " 0.0001 following SGDClassifier from Scikitlearn (Pedregosa et al., 2011). Most importantly, we concatenated different count vectors by way of extracting features described in section 3.2, to obtain the input representation of the model.

First model
This model uses tri-grams of words which are split from the text of the article (we did not segment the text into sentences) the way described in WS 1 in section 3.2.1). Besides, the first model uses hepta-grams of characters from the text of the article as features. We discarded the title when extracting the features for the first model, and we do not distinguish between texts with the HTML ăpą tag wrapping and those without (as mentioned in section 3.1).

Second model
We extracted bi-grams of characters from the body text regardless of whether the text is wrapped by the HTML ăpą tag or not, and for the title, we followed the way mentioned in WS 2 in section 3.2.1) to extract its bi-grams of words. Additionally, we extracted all mentions of named-entity recognition from all sentences of the article, and we distinguished between features from the title and those from the body text.

Third model
This model shares similar features with the second one, except for our extraction of the dependency sub-trees.

Fourth model
In this model, the title, the text with the HTML ăpą wrapping, and those without are distinguished. All sentences are segmented to tokens in the same way described in WS 3 in section 3.2.1). We used tri-grams of words, tri-grams of characters, tri-grams of part-of-speech, syntactic dependency sub-trees, and named-entity recognition tags to extract the text before performing the TF-IDF transformation with Scikit-learn tool (developed by Pedregosa et al. (2011)) on the combined features with min df at 0.05 and max df at 0.95. We did not use the training dataset of 600,000 articles for training all the models in our experiments. The result (Table 1) shows a decrease in performance of the second and the third models when the n-grams of words were not used as features. The accuracy of the third model, however, increased by 2% compared with the second model when the extra dependency sub-trees were used as features. On the other hand, the fourth model achieved the highest accuracy, up to 0.787. This accuracy level, however, is still lower than that achieved via deep learning techniques, such as the Convolutional neural network and pre-trained ELMo representations, employed by "Bertha von Suttner" team who were ranked first in SemEval-2019 Task 4.

Conclusion
Our major contribution to SemEval-2019 Task 4 is that using n-grams of words and dependency sub-trees as features for extracting has a positive impact on the performance of the classifier: In our experiment, we were able to achieve the accuracy of 0.787 with the proposed model that uses tri-grams of words, tri-grams of characters, tri-grams of part-of-speech, syntactic dependency sub-trees, and named-entity recognition tags. That model can also capture words which are not located close to each other through dependency subtrees. The disadvantages of our models, however, are that extraction of dependency sub-trees is a time-consuming process, and the relations between sentences of the articles are not represented.