Dscorer: A Fast Evaluation Metric for Discourse Representation Structure Parsing

Discourse representation structures (DRSs) are scoped semantic representations for texts of arbitrary length. Evaluating the accuracy of predicted DRSs plays a key role in developing semantic parsers and improving their performance. DRSs are typically visualized as boxes which are not straightforward to process automatically. Counter transforms DRSs to clauses and measures clause overlap by searching for variable mappings between two DRSs. However, this metric is computationally costly (with respect to memory and CPU time) and does not scale with longer texts. We introduce Dscorer, an efficient new metric which converts box-style DRSs to graphs and then measures the overlap of n-grams. Experiments show that Dscorer computes accuracy scores that are correlated with Counter at a fraction of the time.


Introduction
Discourse Representation Theory (DRT) is a popular theory of meaning representation (Kamp, 1981;Kamp and Reyle, 2013;Asher, 1993;Asher et al., 2003) designed to account for a variety of linguistic phenomena within and across sentences. The basic meaning-carrying units in DRT are Discourse Representation Structures (DRSs). They consist of discourse referents (e.g., x 1 , x 2 ) representing entities in the discourse and conditions (e.g., male.n.02(x 1 ), Agent(e 1 , x 1 )) representing information about discourse referents. Every variable and condition are bounded by a box label (e.g., b 1 ) which implies that the variable or condition are interpreted in that box. DRSs are constructed recursively. An example of a DRS in boxstyle notation is shown in Figure 1(a).
DRS parsing differs from related parsing tasks (e.g., Banarescu et al. 2013) in that it can create rep-resentations that go beyond individual sentences. Despite the large amount of recently developed DRS parsing models (van Noord et al., 2018b;van Noord, 2019;Evang, 2019;Liu et al., 2019b;Fancellu et al., 2019;Le et al., 2019), the automatic evaluation of DRSs is not straightforward due to the non-standard DRS format shown in Figure 1(a). It is neither a tree (although a DRS-to-tree conversion exists; see Liu et al. 2018Liu et al. , 2019a for details) nor a graph. Evaluation so far relied on COUNTER (van Noord et al., 2018a) which converts DRSs to clauses shown in Figure 1 Given two DRSs with n and m (n ≥ m) variables each, COUNTER has to consider n! (n−m)! possible variable mappings in order to find an optimal one for evaluation. The problem of finding this alignment is NP-complete, similar to other metrics such as SMATCH (Cai and Knight, 2013a) for Abstract Meaning Representation. COUNTER uses a greedy hill-climbing algorithm to obtain one-to-one variable mappings, and then computes precision, recall, and F1 scores according to the overlap of clauses between two DRSs. To get around the problem of search errors, the hill-climbing search implementation applies several random restarts. This incurs unacceptable runtime, especially when evaluating document-level DRSs with a large number of variables.
Another problem with the current evaluation is that COUNTER only considers local clauses without taking larger window sizes into account. For example, it considers "b 4 sing e 2 " and "b 3 NOT b 4 " as separate semantic units. However, it would also make sense to assess " b 3 NOT b 4 sing e 2 " as a whole without breaking it down into smaller parts. By considering higher-order chains, it is possible to observe more global differences in DRSs which are important when assessing entire documents.
In order to address the above issues, we propose DSCORER, a highly efficient metric for the evalu-ation of DRS parsing on texts of arbitrary length. DSCORER converts DRSs (predicted and gold) to graphs from which it extracts n-grams, and then computes precision, recall and F1 scores between them. The algorithm operates over n-grams in a fashion similar to BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), which are metrics widely used for evaluating the output of machine translation and summarization systems. While BLEU only calculates precision with a brevity penalty (it is not straightforward to define recall given the wide range of possible translations for a given input), ROUGE is a recall-oriented metric since the summary length is typically constrained by a prespecified budget. 1 However, in DRS parsing, there is a single correct semantic representation (goldstandard reference) and no limit on the maximum size of DRSs. Our proposed metric, DSCORER, converts box-style DRSs to a graph format used for evaluation and computes F1 with high efficiency (7,000 times faster compared to COUNTER). We release our code, implementing the metric, at https: //github.com/LeonCrashCode/DRSScorer.

DSCORER
The proposed metric converts two box-style DRSs into graphs, extracts n-grams from these graphs, and then computes precision, recall, and F1 score based on the n-gram overlap.

Graph Induction
Following the work of van Noord et al. (2018a), box-style DRSs can be converted to clauses as shown in Figure 1(b). For example, box b 1 is in a contrast relationship to box b 4 within box b 0 which corresponds to the clause b 0 CONTRAST b 1 b 4 ; variable b 2 : x 1 is converted to clause b 2 REF x 1 , and the condition b 1 : t 1 < "now" is converted to b 1 TPR t 1 "now". 2 We now explain how we convert DRSs to graphs. There are two types of clauses depending on the number of arguments: 2-argument clauses (e.g., b 2 male.n.02 x 1 ) and 3-argument ones (e.g., b 1 Agent e 1 x 1 ). The two types of clauses can be formatted as node edge − −− → node and node edge − −− → node edge − −− → node, respectively. For example, clause "b 2 male.n.02 x 1 " is rendered as 1 See https://github.com/tensorflow/ tensor2tensor for computing ROUGE F1.
He didn't play the piano. But she sang. − −−−−− → x 1 , assigns x 1 as a predicate "male.n.02" but x 1 is also an agent), we make edges bidirectional (red in Figure 1(c)) if they do not connect the two b nodes.
Next, we rewrite the nodes, keeping their type 3 (e.g., B, X, E, S, P , and T ) but not their indices and the resulting graph is shown in Figure 1(c). In addition to being typed, variables can be distinguished by their neighboring nodes and connecting edges. For example, the two E nodes are different. One is on the path B showing that the Theme of the predicate play is piano, and the other is on the path ing that the Agent of the predicate sing is female. To compare two graphs, we compute the overlap between extracted paths instead of searching for best node mappings, which saves computational resources (i.e., CPU memory and time).

Evaluation Based on n-grams
An n-gram in our case is an Euler path 4 on a graph with n edges. For example, B is a 3-gram since it has three edges, and a single node is a 0-gram. We extract the n-grams for each node in a graph. Due to the high sparsity of graphs typical for DRSs, the number of n-grams does not explode as the size of graphs increases, |G| = |N | + |E|, where |N | and |E| are the number of nodes and edges in graph G, respectively. Given the n-grams of predicted and gold DRS graphs, we compute precision p k and recall r k as: where k-grams pred and k-grams gold are k-grams on predicted and gold DRS graphs, respectively, and f k = 2p k r k p k +r k , where p 0 = r 0 = f 0 = min(|N pred |,|N gold |) max(|N pred |,|N gold |) . DSCORER calculates precision, recall, and F1 as: 3 B refers to box labels, X to entities, E to events, S refers to states, P to propositions, and T to time. 4 An Euler path is a path that visits every edge of a graph exactly once (allowing for revisiting nodes). where w k is a fixed weight for k-gram (0 ≤ k ≤ n) counts, and F ∈ {p, r, f }.

Experiments
In our experiments, we investigate the correlation between DSCORER and COUNTER, and the efficiency of the two metrics. We present results on two datasets, namely the Groningen Meaning Bank (GMB; ) and the Parallel Meaning Bank (PMB; Abzianidze et al. 2017). We compare two published systems on the GMB: DRTS-sent which is a sentence-level parser (Liu et al., 2018) and DRTS-doc which is a documentlevel parser (Liu et al., 2019a). On the PMB, we compare seven systems: Boxer, a CCG-based parser (Bos, 2015), AMR2DRS, a rule-based parser that converts AMRs to DRSs, SIM-SPAR giving the DRS in the training set most similar to the current DRS, SPAR giving a fixed DRS for each sentence, seq2seq-char, a character-based sequence-tosequence clause parser (van Noord et al., 2018b), seq2seq-word, a word-based sequence-to-sequence clause parser, and a transformer-based clause parser (Liu et al., 2019b).

Metric Settings
COUNTER takes 100 hill-climbing restarts to search for the best variable mappings on PMB and 10 restarts on GMB. Both DSCORER and COUNTER are computed on one CPU (2.10GHz). The weight w 0 is set to 0.1 and the weights w k (1 ≤ k ≤ n) in DSCORER are set to 0.9/n, where n = 4.

Analysis
We analyze the number of n-grams extracted by DSCORER; we also report the values obtained by  DSCORER and COUNTER on the two datasets, their correlation, and efficiency. Figure 2(a) shows the number of n-grams across graphs in GMB where the largest size of 4-grams extracted on one graph is 1.47 × 10 6 . Figure 2(b) shows the number of n-grams across graphs in PMB where the largest size of 4-grams extracted on one graph is 2.27×10 3 . The number of n-grams will increase exponentially with n or as the size of the graph increases. Nevertheless, the number of 4-grams remains manageable. We set k = 4 for computing our metric (see Equations (1) and (2)) as 4-grams are detailed enough to capture differences between meaning representations whilst avoiding overly strict matching (which would render the similarity between predicted and gold DRSs unncessarily low and not very useful). Table 1 shows the various scores assigned by DSCORER and COUNTER to the different systems. We observe similar trends for both metrics; DSCORER penalizes more harshly SPAR and SIM-SPAR, which output random DRSs without any parsing algorithm. Generally speaking, the two metrics are highly correlated; across systems and datatasets, Pearson's correlation coefficient r is 0.93 on 1-grams, 0.94 on 2-grams, 0.91 on 3-grams, and 0.88 on 4-grams, with 2-grams being most correlated. This is not surprising, 2-grams in DSCORER are most similar to COUNTER which only considers predicates with at most two arguments. Figure 3 shows the 4-gram correlation between COUNTER and DSCORER. We found most points are around the curve of y = x 3 , which means that considering high-order grams renders the two metrics less similar, but nevertheless allows to more faithfully capture similarities or discrepancies between DRSs.

Metric Values
Efficiency Table 2 shows the average run-time for COUNTER and DSCORER on a pair of DRSs. Both metrics have similar run-times on PMB which mostly consists of small graphs. However, in GMB, which consists of larger graphs with many nodes, the run-time of COUNTER explodes (more than 4 hours per graph), while DSCORER evaluates DRSs within an acceptable time frame (2.35 seconds per graph). In GMB-doc, DSCORER runs seven thousand times faster than COUNTER, showing it is very efficient at comparing large graphs.

Case Study
We further conducted a case study in order to analyze what the two metrics measure. Figure 4 shows two different sentences in their clause-style DRS format used by COUNTER and graph-style DRS format used by DSCORER. Note that the two sentences have totally different meanings (distinguished using various meaning constructs in the corresponding DRSs). Using COUNTER to compare the two sentences yields an F1 of 47.06, which drops to 16.11 when employing DSCORER on 4-grams. Note that DSCORER on 1-grams obtains an F1 of 46.42 which is close to COUNTER.
COUNTER takes matching clauses into account Tom is putting the children to bed . He smiled . (marked as red in Figure 4), which might inflate the similarity between two sentences without actually measuring their core meaning. For example, the common relation "b 3 Time e 1 t 1 " is matched to "b 2 Time e 1 t 1 " without considering what e 1 and t 1 are. Instead, DSCORER aims to find matches for paths

Related Work
The metric SEMBLEU (Song and Gildea, 2019) is most closely related to ours. It evaluates AMR graphs by calculating precision based on n-gram overlap. SEMBLEU yields scores more consistent with human evaluation than SMATCH (Cai and Knight, 2013b), an AMR metric which is the basis of COUNTER. SEMBLEU cannot be directly used on DRS graphs due to the large amount of indexed variables and the fact that the graphs are not explicitly given; moreover, our metric outputs F1 scores instead of precision only. Opitz et al. (2020) propose a set of principles for AMR-related metrics, showing the advantages and drawbacks of alignment-and BLEU-based AMR metrics. However, efficiency of the metric is crucial for the development of document-level models of semantic parsing. Basile and Bos (2013) propose to represent DRSs via Discourse Representation Graphs (DRGs) which are acyclic and directed. However, DRGs are similar to flattened trees, and not able to capture clause-level information (e.g., b 1 Agent e 1 x 1 ) required for evaluation (van Noord et al., 2018a).

Conclusions
In this work we proposed DSCORER, as a DRS evaluation metric alternative to COUNTER. Our metric is significantly more efficient than COUNTER and considers high-order DRSs. DSCORER allows to speed up model selection and development removing the bottleneck of evaluation time.