FEUP at SemEval-2018 Task 5: An Experimental Study of a Question Answering System

We present the approach developed at the Faculty of Engineering of the University of Porto to participate in SemEval-2018 Task 5: Counting Events and Participants within Highly Ambiguous Data covering a very long tail.The work described here presents the experimental system developed to extract entities from news articles for the sake of Question Answering. We propose a supervised learning approach to enable the recognition of two different types of entities: Locations and Participants. We also discuss the use of distance-based algorithms (using Levenshtein distance and Q-grams) for the detection of documents’ closeness based on the entities extracted. For the experiments, we also used a multi-agent system that improved the performance.


Introdution
Thousands of news articles are published every day on several media outlets. Representing and reasoning over all events in these articles is a challenging task. For instance, if we would like to answer questions about these articles like: How many people died on the shootings in Philippi in 30th September, 2017? or How many people died last year on Birmingham? or How many people were killed by John List?, a deep understanding is needed of many phenomena in the articles. For example, news story updates and duplicate news need to be considered in the answer processing. We can simplify the problem by identifying relevant elements from the news entities and create a structured representation to store these data.
Named Entity Recognition (NER) is a task that aims at identifying and classifying entity mentions in free text. Message Understanding Conference (MUC) defines the entities as belonging to three 1 https://competitions.codalab.org/competitions/17285 categories: 2 1. Enamex: names, such as Locations, Persons, Organizations, and others 2. Timex: temporal expressions 3. Numex: numerical elements, such as numbers and percentages.
In this paper, we present an experimental study to extract entities from news articles to answer questions. We make use of a supervised learning approach to deal with the recognition of two different kind of entities: Locations (e.g. Philippi, Birmingham) and Participants (e.g. John List). We also have studied the use of distance algorithms (Levenshtein and Q-grams) for the near document detection based on entities extracted.
The remainder of the paper is organized as follows. In Section 2, we describe SemEval-2018 Task 5, followed by an overview of the state of the art in Named Entity Recognition in Section 3. In Section 4, we present the state of the art in the Near Document Detection task, followed by the description of the system architecture in Section 5. In Section 6, we presents the approach, followed by the experimental setup in Section 7. The results are discussed in Section 8.

Task Description
The main goal of SemEval-2018 Task 5 (Postma et al., 2018) is to answer questions based on a set of provided news articles, e.g. How many killing incidents happened in 2016 in Columbus, Mississippi?. Each question has three components: an event type and two event properties. Each question contains one out of four event types: killing, injuring, fire burning, and job firing. Event Properties are all the related characteristics associated with the event. They can include Locations (City or State), Participants (First Name, Last Name, Full Name), and Time ( Day (e.g. 1/1/2015), Month (e.g. 1/2015) or Year (e.g. 2015)). There are three subtasks: • Subtask 1 (S1): Find the single event that answers the question

Named Entity Recognition
A wide range of approaches have been developed to tackle NER. Early systems deal with this issue by making use of handcrafted rule-based algorithms (Hearst, 1992). More recently, systems focus on machine learning techniques (supervised learning (Florian et al., 2003), semi-supervised (Collins and Singer, 1999;Mikheev et al., 1999), and unsupervised learning). However, the major drawback of supervised learning is its dependence on annotated data. In the case of unavailability of training examples, handcrafted rules remain the practical technique (Riaz, 2010).

Near Document Detection
In the large amount of news articles that are published every day, the same information can be repeated in many different articles. The identification of similar or near-duplicate documents is applied in: plagiarized documents detection (Hoad and Zobel, 2003), similar web pages detection (Henzinger, 2006), and similar news articles detection (Abreu et al., 2015). Identification of similar or near-duplicate pairs of documents in a large collection is a significant problem with wide-spread applications. Kumar and Govindarajulu (2009) present approaches used to solve this issue. For those kind of problems, three main approaches are proposed: based on URLs, on lexicon and, the third and more sophisticated, on semantics (Abreu et al., 2015).
In the work presented here, we are using the semantics-based approach applied to the information previously extracted from the news articles.

Architecture
The system consists of the following main components: Creating a Structured News Representation.
Copyright 2017 by WJXT News4Jax -All rights reserved. To get alerts for breaking news, grab the free NBC4 News App for iPhone or Android. contact kimber laux at klaux@reviewjournal.com or 702 -383 -0283. Contact Jessica Terrones at jterrones@reviewjournal.com or at 702-383-0381.   Figure 1 presents the architecture used to parse the news article. After converting CoNLL to plain text, journalists patterns are removed as demonstrated in Table 5. Journalistic patterns could be relevant for the reader, but not for the entity recognition task. The output of this system is a structured news representation with a list of Event Types, Locations, Participants, and Temporal Expressions. Additionally, the following sources of information are also extracted: the news identifier, publication date, and news title. To result in this representation, the following four extractions are performed: Extract list of Event Types. We use WordNet (Fellbaum, 1998) to create a list of words that can be used to describe an event type. Our approach uses the news article title and body for the event type recognition. For each one of these elements, the English Snowball Stemmer is applied. We consider a document to have a certain event type if at least one term that describes an event type is present in the news title or body.
Extract Locations and Participants. For the Locations and Participants recognition, a supervised approach is used. The approach proposed is described in Section 6.
Extract Temporal Expressions. Our approach to finding temporal information in the news article is based on the application of regular expressions. Table 2 presents some of the regular expressions used.
668 Figure 1: Create a structured news representation approach Extract Auxiliar Information. The title, publication date, and news identifier are also extracted from the news article to create the structured news representation.
Search all the news that answer a question When the system receives a question, an answer will be retrieved based on the structured news representation. Firstly, for each element (Event Type and Event Propreties) a list of news articles that has some relation with the element under analysis is composed. In the end, the news or set of news articles that address all the items under analysis are extracted.

Near document detection
The near document detection was done based on the set of news that answers a question. The approach is explained in Section 6.

Counting participants
Similarly to what happened in the case of previously mentioned events' extraction, this one only uses a news article or a set of news articles that answer a question. For this set of news articles, we only process the information given by the news article title. For each Event Type we manually define the variation trend (increase/ decrease/ stable) -e.g. the number of death can increase with the decrease of the number of injured -in a killing event. We started this process by normalizing and removing temporal expressions from the news article title. After, we applied the POS-tagger and split the sentence into subsentences separated by comas. We started to recognize the event type for each subsentence. When we found it, we checked if the subsentence also includes a numerical element ('CD' -Post tagger) -this element is considered as a number of participants associated with the event type. Once extracted the number of participants associated with each news article, we connect this information with news article's date. Finally, we try to find the maximum or the minimum of participants depending on the temporal event type trend.
6 The proposed approach 6.1 NER Supervised Approach In this subsection, we describe the implementation details of the proposed approach for recognizing Locations and Participants.

Natural Language Processing Tasks:
The data was preprocessed with two NLP tasks: part of speech tagging and stop word recognition.

Features
Supervised learning techniques require their input to be categorized. When extracting information from news documents, it is common to label each word with a set of features. These features allow the SL approach to recognize an entity in a given document. We extracted the following features: 1. CAP (Capitalized) indicates whether a word: contains no capital characters,  . NP -Paragraph records a numeric identifier of the paragraph in which the word appears. Table 3 presents example of features computed for each word in the phrase "shooting at a west Phoenix apartment that left one man dead". For instance, the word "Phoenix" is capitalized (CAP = 1), corresponds to a noun (P T = N N P ), and is not a stop-word (SW I = 0). Note that we aggregate all sequential capitalized words as one, e.g. "Salt Lake City" will be combined in a single word to be classified.
We believe that a simple association as illustrated in Table 3 is not enough to categorize a word for the Named Entity Recognition task. For this reason, we also consider the word context in the document, i.e., the current word (C), the previous word (P), and the next word (N). Here we indicate the word position following the feature abbreviation, e.g., "C CAP" indicates whether the current word is capitalized or not.

Data Cleaning and Transformation
Data quality is the main challenge of information management. To guarantee data quality, two processes were executed: data cleaning and data transformation. Tables 4 and 5 present the data transformation for POS tags and stop-words.

(POSTagger -All
Tags) http://www.nltk.org/book/ch05.html visited on 2017, November Stop-words have no value for SWA. To fix this, we replace an empty value by the character "X" and we encode this value as demonstrated in Table 5

Classification Algorithms
Supervised learning techniques create a model that predicts the value of a target variable based on a set of input variables. One challenge is to select the most appropriate algorithm for the task of classifying Locations and Participants. We have compared the following algorithms: Support Vector Classifier (SVC); Decision Tree Classifier (Tree); Random Forest Classifier (Random); Extra Trees Classifier (Extra). As demonstrated on Table 6, different configurations were attempted for each algorithm. Implementations of these algorithms are provided by the Python library scikit-learn library 4 .

Near Document Detection
The answer to a question in this SemEval task consists of the following: question identifier, set of the news articles that help to answer the question, and a numerical answer.
The numerical answer of a question is dependent on the question task. Task 2 requires a number of unique events that correspond to a question. For this purpose, it is essential to detect similar news documents within the given set. To detect similar documents, we use the structured news representation described above. Each pair of news articles is compared based on: their titles, their lists of Participants, and their lists of Locations. 7 Experimental Setup 7.1 NER Approach

Evaluation
The evaluation metrics used to evaluate this approach are Precision (P), Recall (R), and F1 (F). Due to a large number of experiences and in order to correctly analyze the obtained results, we made use of a multi-agent architecture to find the best results. For this evaluation, we defined a utility function and we introduced an auction mechanism to enable some kind of negotiation. This mechanism is based on English auction, where each agent can propose their bids following the auction requirements. Our agents represent the different configuration of the classification algorithms and each bid reveals their result on a specific test scenario. We expect that in this experiment recall is the most important metric, thus it is assigned a higher weight than the other metrics. Our utility function was defined as follows: In order to reduce the data to be analyzed we exclude all combinations with low performance, namely all combinations where either Recall, Precision, or F1 has a mean value bellow 60% or a standard deviation above 15%.
Experiments A supervised learning system was needed to generate a model. The classification algorithms and the scenarios (S1, S2, S3, and S4) defining values of features are those described in section 6.1.
Our experiments were done taking crossvalidation with k = 7 into account. We divided the annotated data into partitions of training data (75%) and testing data (15%).

Data Resources
Near document detection approach was studied with the dataset provided at the end of the competition. Each intended answer includes a list of similar documents identified in the given dataset and aggregated according to the corresponding question. For each answer, we created a script to aggregate all news articles in pairs. Additionally, a label indicating whether a pair is similar or nor (pairs that are contained in the same set are similar) was added. In total, this resulted in 61,931 pairs of news articles.

Evaluation
We evaluate the performance of various thresholds on near document detection by applying the metrics: Precision, Recall, and Accuracy.

Experiments:
For each pair of news articles, we have calculated the similarity between their elements: title (T), list of participants (Part), and list of locations (Loc). For the sake of comparison, we have used two distance algorithms: Levenshtein (L) (Levenshtein, 1966) and Qgrams (Q) (Ullmann, 1977). We defined two scenarios (SS1, SS2), differing in the weights of the document elements as follows:

NER Approach
Due to the large volume of combinations and their corresponding results, we used a multi-agent system to simplify the analysis. Tables 7 and 8 present the best 5 results achieved on extracting Participants and Locations respectively. We considered the results only from two algorithms: Decision Tree and Extra Tree Classifier. Both approaches show that context helps the recognition task.

Conclusion and Future Work
In this work, we presented an experimental study that addresses the Question Answering challenge in SemEval-2018 task 5. We have used Named Entity Recognition approaches to identify entities such as Location, Participants, Temporal Expressions, and Event Types. We used a structured news representation to perform the required tasks: 1. to answer questions on counting events 2. to detect which distinct documents provide an answer to a question; and 3. to answer questions by counting event participants. The use of multi-agent system was crucial in order to find the best performing algorithm. Our utility function allowed us to have a previous definition of the influence of each evaluation metric on the overall evaluation. The resulting system can be applied to other scenarios by adapting the utility function according to their requirements. In the future, our system can be improved to include multiple combinations (e.g., on the near document detection we can use a different combination of elements and algorithms).
Task 2 has been solved with extracting information. Future work for this task could include another study of a supervised learning approach that is based on the entire information available in a news article. However, such a requires a corresponding annotated corpus. Our approach to Task