Frame- and Entity-Based Knowledge for Common-Sense Argumentative Reasoning

Common-sense argumentative reasoning is a challenging task that requires holistic understanding of the argumentation where external knowledge about the world is hypothesized to play a key role. We explore the idea of using event knowledge about prototypical situations from FrameNet and fact knowledge about concrete entities from Wikidata to solve the task. We find that both resources can contribute to an improvement over the non-enriched approach and point out two persisting challenges: first, integration of many annotations of the same type, and second, fusion of complementary annotations. After our explorations, we question the key role of external world knowledge with respect to the argumentative reasoning task and rather point towards a logic-based analysis of the chain of reasoning.


Introduction
Recently, Habernal et al. (2018) introduced a challenging dataset for Argument Reasoning Comprehension (ARC) used in the SemEval-2018 shared task. After reviewing the participating systems, they hypothesize that external world knowledge may be essential for ARC. 1 We explore enriching models with event and fact knowledge on ARC to investigate into this hypothesis.
Semantic tasks profit from external knowledge: language understanding requires more complex knowledge than that contained in current systems and word embeddings. For the task of semantic plausibility, Wang et al. (2018) reveal the failure of models only relying on distributional data, whilst the injection of world knowledge helps. Glockner et al. (2018) point out the deficiency of state-ofthe-art approaches for understanding entailment on the large-scale SNLI corpus (Stanford Natural Language Inference) (Bowman et al., 2015). In their study, the model incorporating external lexical information from WordNet, KIM (Knowledge-based Inference Model) (Chen et al., 2018), does not yield the awaited improvements -where the crucial point might be WordNet (Miller, 1995) which does not contain explicit world knowledge in the form of event-and fact-based knowledge. Previous work argues that information in WordNet overlaps with word embeddings (Zhai et al., 2016), therefore we focus on other types of knowledge in our work.
Complementary sources of external knowledge: we experiment using the lexical-semantic resource FrameNet (FN) and the knowledge base Wikidata (WD). These resources provide information beyond the lexical relations encoded in WordNet and thus have a potential to enhance the underlying model with other kind of external world knowledge. On the one hand, FN provides qualitative eventknowledge about prototypical situations. Thus, identifying frames (FrameId) unveils the situation or action that is happening. On the other hand, WD provides fact-knowledge about the concrete entities. So, linking entities to a knowledge base (EntLink) disambiguates the participants. Furthermore, both resources provide meta-knowledge about how their frames or entries relate to each other.
Multiple levels of knowledge processing help: combining several kinds of annotations benefits question answering (Khashabi et al., 2018), external knowledge about synonyms enhances inference (Chen et al., 2018), and jointly modeling several tasks (e.g., frame-semantic parsing and dependency parsing) is fruitful (Peng et al., 2018). In particular, the idea of connecting event semantics and fact knowledge was confirmed by Guo et al. (2016): they jointly formalize semantic role labeling and relation classification and thereby improve upon PropBank semantic role labeling.
Outline In this paper, we investigate whether external information in terms of event-based frames (FN) and fact-based entities (WD) can contribute to holistic understanding of the argumentation in the ARC task. First, we examine the effect of both annotations separately and second, we explore whether a joint annotation benefits from the inherent complementarity of the schemata in FN and WD and eventually leads to a better annotation coverage. We enhance the baseline model provided with the ARC task in order to contrast three system configurations: '+FN', '+WD' and '+FN/WD'.
Contributions We (1) present a proof of concept for semantic enrichment for the ARC task, (2) identify the importance of advanced combinations of complementary semantic annotations and (3) question the key role of external world knowledge with respect to ARC.
Code The code for the experiments with the enriched model is available online: https: //github.com/UKPLab/emnlp2018argmin-commonsense-knowledge 2 Our Approach: Semantic Enrichment for Argument Reasoning Comprehension (ARC) First, we explain the ARC task together with the baseline that we will build upon (cf. Sec. 2.1). Second, we review our two external knowledge sources, FN and WD, and comment on their complementarity (cf. Sec. 2.2, 2.3, 2.4). Finally, we present our approach with preprocessing and the actual model enrichment (cf. Sec. 2.5, 2.6).

ARC Task
The ARC task (Habernal et al., 2018) is formulated as follows: given a debate title (a), claim (b) and reason (c), a system chooses the correct warrant (i) over the other (ii), see Figure 1. The warrants vary only slightly, e.g., by a single negation. The argumentation chain is sophisticated and uses logical reasoning and language understanding. In order to automatically draw the correct decision, a holistic understanding of the context of both, claim and reason, is crucial -for which Habernal et al. (2018) recommend the inclusion of external knowledge.
Baseline The baseline provided by Habernal et al. (2018) is an intra-warrant attention model that reads in Word2Vec vectors (Mikolov et al., 2013) of all words in (a-c) and adapts attention weights for the decision between (i) and (ii).
Shared task winner The shared task winner, GIST (Choi and Lee, 2018), transfers inference knowledge (SNLI, Bowman et al., 2015) to the task of ARC and benefits from similar information in both datasets.
Our approach in contrast to GIST Our approach extends the baseline model with two external knowledge schemata, FN and WD, to explore their effects. The intuition can be explained with the instance in Figure 1: FN could be helpful by disambiguating 'companies' and 'corporations' to the same frame with meta-knowledge how it relates to other frames and WD could be of additional help by mapping them to entities with detailed information and examples for such institutions. We focus on utilizing the two knowledge schemata of FN and WD and thus, our interest is orthogonal to GIST. The advantage of our approach is independence of domain and task, which becomes especially relevant in scenarios lacking large-scale support data.

FrameNet's Event Knowledge
The Berkeley FrameNet (Baker et al., 1998;Ruppenhofer et al., 2016) is an ongoing project for manually building a large lexical-semantic resource with expert annotations. It embodies the theory of frame semantics (Fillmore, 1976): frames capture units of meaning corresponding to prototypical situations. It consists of two parts, a lexicon that maps predicates to frames they can evoke, and fully annotated texts. For example, the verb buy can evoke either the frame Commerce buy or Fall for, depending on the context (buying goods versus buying a lie). Furthermore, the lexicon gives access to frame-specific role-labels (e.g., Buyer, Goods or Deception, Victim) as applied in semantic role labeling. Finally, FN specifies high-level relations (e.g., inherit, precede) between frames, forming a hierarchy with a collection of (frame,relation,frame)triples. We use FN 1.5 which contains ∼1K frames and ∼12K distinct predicate-frame combinations. 2

Wikidata's Fact Knowledge
Wikidata is a collaboratively constructed knowledge base that encodes world knowledge in the form of binary relations. It contains more than 40 million entities and 350 million relation instances. 3 The binary relations express both semantic and ontological connections between the entities (e. g. CAPITAL (Hawaii, Honolulu); INSTANCE OF (Hawaii, location)). Wikidata includes an ontology of finegrained classes and is interlinked with other semantic web resources. A broad community curation, similar to Wikipedia, ensures a higher data quality compared to other knowledge bases (Färber et al., 2015). Formally, Wikidata can be described as a graph W = (E, R, I), where E is a set of entities, R is a set of binary relation types and I is a collection of relation instances encoded as r(e 1 , e 2 ), r ∈ R, e 1 , e 2 ∈ E.

Complementarity of Annotations
Work on event semantics hints at two annotation types complementing each other: additional information about participants benefits event prediction (Ahrendt and Demberg, 2016;Botschen et al., 2018) and context information about events benefits the prediction of implicit arguments and entities (Cheng and Erk, 2018). The complementarity is further affirmed by efforts on aligning WD and the FN lexicon: the best alignment approach only maps 37% of the total WD properties to frames (Mousselly-Sergieh and Gurevych, 2016). The complementarity of FN an WD annotations is the reason for also testing a model with the joint annotation '+FN/WD'.

Preprocessing -Obtaining Annotations
We use two freely available systems to obtain semantic annotations for the claim (b), the reason (c) and the alternative warrants (i, ii): the frame identifier by Botschen et al. (2018) for frame annotations and the entity linker by Sorokin and Gurevych (2018). We employ pre-trained vector representations to encode information from FN and WD. We use the pre-trained frame embeddings (50-dim.) that are learned with TransE (Bordes et al., 2013) on the structure of the FN hierarchy with the collection of (frame, relation, frame)-triples (Botschen et al., 2017). We also use TransE to pre-train entity embeddings (100-dim.) on the WD graph. The an- Figure 2: Different embeddings from layers of annotations for a sentence: words, frames, entities.
notation of the ARC data leads to more frames per sentence (6.6 on avg.) than entities per sentence (0.7 on avg.).

Model -Enriching with Semantics
We extend the baseline model by Habernal et al. (2018) with embeddings for frames and entities (cf. Sec. 2.5 for frame embeddings and entity embeddings). The baseline model is an intra-warrant attention model that only uses conventional pretrained word embeddings as an input. We apply a late fusion strategy: obtain the annotations separately and combine them afterwards by appending the frame and entity embeddings to the word vectors on the token level. Each input sentence is processed by a bi-directional long short-term memory (LSTM) network that reads not only word embeddings, but also frame embeddings for all event mentions and entity embeddings for all entity mentions ( Figure 2). Now, the attention weights for the decision between the two warrants are adapted based on the semantically enriched representation of claim (b) and reason (c). 4 We optimize hyperparameters on the development set with random search. All models are trained using the Adam optimizer (Kingma and Ba, 2014) with a batch size of 16. For our evaluation, we perform ten runs and report the mean and max accuracy together with the standard deviation.

Results
In Table 1 we report our results on the ARC task. Our extended approaches '+FN' and '+WD' for semantic enrichment with information about frames and entities increase the averaged performance by more than one percentage point against the baseline. For the best run, the advantage of '+FN' and '+WD' becomes even clearer (+2.2 pp.). On the other hand, the straightforward combination of the two external knowledge source, '+FN/WD', does not lead to further improvements. This points out the need for advanced models that are able to fuse annotations of different types. Albeit positive the results do not seem be a strong support for the hypothesis of (Habernal et al., 2018) about external knowledge being beneficial for the defined task, as we observe only moderate improvements. Overall, the enriched models ('+WD', '+FN' and '+FN/WD') make mostly the same predictions as the baseline system. For instance, for '+WD' there is 79, 5% overlap of the predictions with the baseline, and for '+FN', it is 76.6%. In the following section, we try to identify the reasons why the structured knowledge in the form of FN and WD does not further improve the results.

Error analysis
The amount of semantic information that the model can utilize is dependent on the number of annotations for a test instance 5 . We analyze the performance of the enriched models by the number of annotations for '+WD' and for '+FN'. Figure 3 shows the performance of '+WD' in comparison to the baseline against the number of WD entities per test instance. As expected, there is no difference in performance for the instances without WD annotations. We can see a clear improvement for the instances with one or two entities, which indicates that some semantic knowledge is helping to draw the decision between the two warrants. Contrary, '+WD' performs equal to the baseline for three or more annotations. The performance of the '+FN' model against the number of the frame annotations is plotted in Figure 4. Whilst the difference between '+FN' and baseline varies more, we can observe a similar tendency: once some semantic annotations are available the enriched model outperforms the baseline, whereas with the growing number of frames the difference in performance decreases.
Both annotation tools, the WD entity linker as well as the FN frame identifier, introduce some noise: for the entity linker, Sorokin and Gurevych (2018) report 0.73 F-score and the frame identifier has an accuracy of 0.89 (Botschen et al., 2018). We perform a manual error analysis on 50 instances of the test set to understand the effect of the noisy WD annotation. 6 In 44% of errors, no WD annotation was available and in 52%, the annotations were (partially) incorrect. Only 4% of errors oc- cur despite correct entities being available to the model. Notably, in 65% of the cases a correct entity annotation leads to a correct prediction. Taken together, for instances with little context and therefore only some annotations with frames or entities, the semantic enrichment helps to capture a broader context of the argumentation chain which in turn benefits the classification. However, for instances with more context and therefore more annotations with frames or entities, the benefit is turned down by a worse precision of the annotation tools. Interestingly, the effect of improved performance only for shorter sequences with less annotations is in line with findings of research on information retrieval (Müller et al., 2008), where the trade-off between some annotations that increase the accuracy and more annotations that can hurt the performance is known as precision-recall balance (Riedel et al., 2010;Manning et al., 2008).

Qualitative analysis
When manually inspecting the annotations of frames and entities, it becomes questionable whether these actually have the potential of contributing to a clear distinction between the two warrants. Figure 5 shows two instances of the ARC corpus with FN and WD annotations. Both annotation layers contribute useful information about the world, which is not contained in the textual input. For instance, 'companies' and 'corporations' are correctly disambiguated and linked to the same frame and the phrase 'use of force' is mapped to the entity Q971119 for a legal concept. Nevertheless, when manually inspecting the annotations of frames and entities it becomes apparent that the provided background knowledge is not sufficient to draw the distinction between the two warrants. In the first example in Figure 5, the key difference between the two warrants is the negation (and similar in the second example). Even if our approach performs a semantic enrichment of the context, the crucial capability of performing reasoning is still missing. This means, our input representation is semantically enriched, but is not parsed into a logicbased representation.
To sum up, in the previous Section 3.1, we show that our approach is of help by successfully enriching the context with semantics for shorter instances; and in this Section 3.2, we analyze why our approach is too limited to solve some key challenges of the ARC task. We conclude with the key challenge of ARC not being lexical-semantic gaps between warrants but rather different phenomena such as negation and that this challenge is to be resolved with logical analysis rather than with integrating world knowledge.

Conclusion
We integrate world knowledge from FrameNet and Wikidata into the task of common-sense argumentative reasoning and achieve an improvement in performance compared to the baseline approach. Based on the experiments and the manual analysis, we conclude that external world knowledge might not be enough to gain significant improvements in argumentative reasoning, and we rather point towards logical analysis.
Starting from the hypothesis of the evaluators of the shared task about world knowledge being essential for the Argument Reasoning Comprehension task, we show the potential of semantic enrichment of the context for shorter instances. Our results offer a first perspective on using external resources for the Argument Reasoning Comprehension task. More broadly, our approaches '+FN' (events) and '+WD' (facts) showcase the contribution of semantic enrichment to high-level tasks requiring common sense knowledge.
FrameNet and Wikidata are open-domain resources and our enrichment approach is taskindependent. Consequently, we encourage utilizing event-and fact-knowledge in further language understanding tasks, e.g., Story Cloze (Mostafazadeh et al., 2016) or Semantic Textual Similarity (Agirre et al., 2012), using our freely available model.