SOCCER: An Information-Sparse Discourse State Tracking Collection in the Sports Commentary Domain

In the pursuit of natural language understanding, there has been a long standing interest in tracking state changes throughout narratives. Impressive progress has been made in modeling the state of transaction-centric dialogues and procedural texts. However, this problem has been less intensively studied in the realm of general discourse where ground truth descriptions of states may be loosely defined and state changes are less densely distributed over utterances. This paper proposes to turn to simplified, fully observable systems that show some of these properties: Sports events. We curated 2,263 soccer matches including time-stamped natural language commentary accompanied by discrete events such as a team scoring goals, switching players or being penalized with cards. We propose a new task formulation where, given paragraphs of commentary of a game at different timestamps, the system is asked to recognize the occurrence of in-game events. This domain allows for rich descriptions of state while avoiding the complexities of many other real-world settings. As an initial point of performance measurement, we include two baseline methods from the perspectives of sentence classification with temporal dependence and current state-of-the-art generative model, respectively, and demonstrate that even sophisticated existing methods struggle on the state tracking task when the definition of state broadens or non-event chatter becomes prevalent.


Introduction
State tracking, the task of maintaining explicit representations of user requests and agent responses, has long been a key component of dialogue systems (Williams et al., 2013;Henderson et al., 2014a,b;Kim et al., 2016). The same challenge arises during reading comprehension of procedural texts (recipes, how-to guides, etc.) where systems focus on predicting changes of object attributes at the entity-level (a car window may transition from foggy to clear) (Dalvi et al., 2018;Tandon et al., 2020). However, both of these state tracking variants rely on transaction-based or turn-based data such as transactional dialogues or procedure descriptions that are information-dense. Few works have studied state tracking tasks where state changes occur infrequently while a large proportion of messages are "chatter".
As an alternative to altogether unrestricted state tracking-a task that is daunting due to the complexity of even describing ground-truth states in a discrete manner-we resort to a simpler and more self-contained setting: sports competitions. Given the stream of natural language utterances with which a commentator describes the events in a real-world setting (here a sports competition), an ideal natural language understanding system would maintain and reason over a coherent and accurate representation of the match based on how the commentator described it. This representation can, in turn, be used for downstream tasks such as inference or language generation. Sports matches provide an ideal test bed for state tracking due to their self-contained, fully observable nature and their inherent interpretability in the form of the temporal evolution of scores. However, existing sports-related commentary collections such as described by Aull and Brown (2013) and Merullo et al. (2019) do not provide such within-match temporal information.
To this end, we collect temporally-aligned commentaries and live scores of soccer matches along with other meta information from the website goal.com and compile the dataset SOCCER. To the best of our knowledge, SOCCER is the first temporally-aligned collection of sports match commentary and state. It contains over 2,200 matches from tournaments such as the UEFA Champions League or the UK Premier League between 2016 and 2020. Across these matches, there are over 135,000 individual comments and approximately 31,000 events. A simplified example is shown in Figure 1.
To demonstrate the potential of state tracking for open-domain discourse, we use the proposed dataset to investigate to what degree state-of-theart systems are able to track the progression of events described in the commentary. This overview includes two model classes: classification models that treat match events as different class labels, and generative language models such as GPT-2 (Radford et al., 2019) that model context and events in a causal manner. Our experiments show that both methods do not perform well on SOCCER and only slightly outperform distributional heuristics, leaving considerable room for improvement.
The novel contributions of this paper are threefold: (1) we propose a new task of tracking event occurrences via state changes, (2) we create SOCCER, a general discourse state tracking dataset that contains temporally-aligned human-composed commentary and in-game events, serving as the training and evaluation dataset for this task, and (3) we provide two intuitive baselines demonstrating the difficulty of this task and presenting exciting opportunities for future research.

Related Work
Dialogue State Tracking (DST). Current DST collections and benchmarks tend to rely on transaction-centric dialogues with predefined domain-specific ontologies and slot-value pairs. Prominent examples include the DSTC2 (Henderson et al., 2014a) and MultiWOZ datasets . Consequently, previous work focuses on picklist-based approaches (Mrkšić et al., 2017;Perez and Liu, 2017;Zhong et al., 2018; to formulate state tracking as a series of classification tasks over candidate-value lists. A major difference between SOCCER and other DST datasets lies in its information density. As dialogues in DST are usually short conversations with direct transactional objectives such as booking hotels or reserving restaurant tables, frequent state changes are required to be captured within limited turns of the conversation. In sports commentary, on the contrary, in-game events occur at a comparatively low frequency and a considerable proportion of commentator utterances may not be related to any changes in the game state. State Tracking in Procedural Text. State tracking in procedural text understanding focuses on the task of tracking changes in entity attributes (Tandon et al., 2020). A variety of procedural progresses have been proposed such as tracking entity presence and location in scientific processes (Dalvi et al., 2018), ingredients in cooking recipes (Bosselut et al., 2017), and character motivation and emotional reaction in simple stories (Rashkin et al., 2018). Yet, similar to DST settings, these highly specific tasks depend on small fixed ontologies covering limited ranges of entities and states. Another more recent dataset (Tandon et al., 2020) turns to an open-vocabulary setting when defining entity attributes. But since the dataset is comprised of how-to guides from WikiHow.com, the task still sees a high density of state changes per natural language instruction.
Information Density The concept of Information Density has been mainly used in the Uniform Information Density (UID) theory (Jaeger, 2010) to measure the amount of information per unit comprising an utterance. Levy and Jaeger (2007) demonstrated that speakers tend to maximize the uniformity of information via syntactic reduction. The notion of information density in our paper, however, focuses on quantifying the frequency of event occurrences on the corpus level instead of understanding syntactic choices on the utterance level.
Sports Event Datasets and Tasks. Commentary in the sports domain has been collected to study a variety of problems such as racial bias in football game reporting (Merullo et al., 2019) and gender construction in NBA/WNBA coverage (Aull and Brown, 2013). However, these datasets do not provide any information on the temporal alignment between commentary and events. Another similar dataset, BALLGAME (Keshet et al., 2011) is comprised of baseball commentary with annotated events and timestamps, but it contains less than 20 games and the annotation is unavailable online. Some work focuses on sports-related inference of player performance metrics (Oved et al., 2019) or game outcomes (Velichkov et al., 2019) that predict full-time results based on signals from pre-game player interviews. However, no in-game sequential contexts are provided in these datasets. Most similar to our work, Bhagat (2018) collected in-game commentaries for soccer player analytics, but their approach is restricted by classical machine learning methods and ignores the effect of information sparsity within the dataset.

Dataset Construction
We collect time-stamped commentary with key events of 2,263 soccer matches in total. The matches stem from four major soccer tournaments including the UEFA Champions League, UEFA Europa League, Premier League and Series A between 2016 and 2020. SOCCER consists of over 135,000 time-stamped pieces of commentary and 31,000 within-match events. This section describes our data collection and preparation process in detail.

Data Processing
Commentaries, events, team lineups, match dates and other meta-information are gathered from match-specific pages. Out of a total of 9,028 matches covered on goal.com between 2014 and 2020, we retain only those 2,434 matches that list detailed event records and commentary. Any matches missing either of the two information streams are discarded. The retained matches belong to the four major tournaments mentioned above and all occurred starting 2016. Figure 2 shows the frequency distribution of included and overall matches across the years in which they took place. All commentaries are in English and available in text form, thus requiring no transcription. Pieces of commentary come pre-segmented and aligned to match-internal timestamps so that in-game events and commentary with the same timestamps can be linked. Comments whose temporal information is unavailable usually belong to the pre-game, intermission and post-game periods and are labeled as START, BREAK, END accordingly. The total number of commentary paragraphs within a game is the same as the number of timestamps. This number varies between matches as timestamps during which the commentator did not provide commen- tary are omitted. Finally, any templated sentences following the format "team 1 score -score team 2" are removed to avoid trivial leakage of the match state. All annotation and filtering processes are done programmatically and no manual efforts are involved.
Events are classified into five types: goal, assist, yellow card, red card and switch. We consider events as keys and the event-related players as the corresponding values. For example, if player B from the home team assists in scoring a goal, player B will be the value of the event assist for the home team. Hence, at each timestamp t, there are ten event-player pairs (five event types tracked for two teams). From this representation, we construct a comprehensive game state incorporating all the event-player pairs for each team as well as a cumulative score at each timestamp (See Figure 3). Special events such as penalty goals or own goals are not explicitly labeled, but can be derived from the evolution in cumulative score between neighboring timestamps. After processing, 171 games were found to have ill-formed commentary or misaligned end-game match scores compared to the goal records in the key events. These matches were eliminated from the original 2,434 games crawled with commentary, giving us a total of 2,263 games. Finally, the collected data is partitioned into distinct training (70%), validation (15%) and test (15%) sets.

State Definition and Task Proposal
For each match m in the dataset M , there is a set of timestamps T m = {t} accurate to a minute. As input, we are given a stream of commentaries C m = {c t } Tm t=1 and c t represents the paragraph of commentary at time t. The output will be a set of general match states S m = {s t } Tm t=1 such that each s t reflects the state change in the comment c t at the same timestamp. s t contains a set of events e (t) i,j , where i represents the event types (i ∈ {goal, assist, yellow card, red card, switch}) and j denotes the event actor (j ∈ {home, guest}).
Given the sparse distribution of s t , we propose two alternative variants of the variable to assess the difficulty of state tracking at different granularity levels of state resolution.
Team Level. In this simplest notion of state, events are tracked at the team level. In other words, e goal, home to be yes if the home team indeed scored a goal in a given minute, or no otherwise.
Player Level. At this significantly increased level of resolution, all events are additionally associated with their player agents p ∈ P , where P denotes the collection of players. Concretely, the variable e (t) i,j is mapped to either the related players' names p or a none answer to each event at time t. To facilitate this form of state, match metainformation includes lineups that associate present players with teams.

Analysis and Baseline Experiments
In the following, we provide descriptive statistics of the SOCCER dataset and include two model baselines for recognizing match events resulting in changes of states.

Dataset Statistics and Comparison
The SOCCER dataset covers 2,263 matches with 135,805 pieces of commentary and 31,542 in-game event records. In all event records, each event type of each team appears approximately 3,154 times on average. There are a total of 3,507 unique player names across all event types and an average 1,219 unique player names per event type per team. A more detailed overview of the distribution of event types and player names can be seen in Table 1.
Common state tracking datasets either in dialogue systems or procedural texts are designed to capture frequent state changes in the text. In   SOCCER, we study a more general setting where the corpus is much less information dense due to an abundance of non-event related chatter. To quantify this difference, we define information density (ID) as:

ID =
Total # of state changes Total # of turns/steps/timestamps As shown in Table 2, our dataset has a considerably lower information density with more turns of information. In SOCCER, the match state only gets updated every 5 timestamps, while in datasets such as MultiWOZ2.1 (Eric et al., 2019) and OpenPI (Tandon et al., 2020), there are between 1 and 4 state changes per turn or step on average.

Baseline Setup
SOCCER presents a new challenge to the state tracking community by introducing a more general corpus with an all-new state definition and a sparse information distribution. These properties render it difficult to directly apply some existing models such as TRADE used in DST tasks and ProLocal (Dalvi et al., 2018) proposed for procedural texts. Motivated by previous work on state tracking and based on the characteristics of the task, we use two baseline training and inference schemes: 1) a GRU (Cho et al., 2014) classifier with pre-trained BERT (Devlin et al., 2019) embeddings, and 2) a generative pre-trained GPT2 (Radford et al., 2019) variant.
GRU Classifier with BERT Embeddings. The GRU model is used as a preliminary baseline to assess the difficulty level of the SOCCER dataset. Embeddings of the timestamped commentary c t are obtained from the pretrained weights of BERT (Devlin et al., 2019), that then get fed into a 1-layer GRU (Cho et al., 2014) network followed by two feed-forward layers. We only tasked this model with team-level state tracking as the classification will be extremely difficult if each player name is treated as a distinct class. We map the 10 event variables e (t) i,j as binary flags to a 10-bit scalar value in which each digit denotes the predicted value of a variable. For example, if the 0th position corresponds to the variable e (t) goal, home , then the predicted value at that position denotes whether the home team scores a goal (See Figure 4). Compared to converting the problem into ten binary classifications, this allows us to directly model the joint occurrence of events.
GPT-2 Based Variant. Recent approaches to state tracking (Kim et al., 2019;Hosseini-Asl et al., 2020;Tandon et al., 2020) have shown that generative models are competitive especially in open-vocabulary settings. Inspired by simpleTOD (Hosseini-Asl et al., 2020) and the OpenPI baseline (Tandon et al., 2020), we cast the player-level state tracking task as a sequence generation problem, allowing us to leverage the capabilities of causal language models such as GPT-2 (Radford et al., 2019). The training sequence consists of a concatenation of the commentary, event types and player names, allowing us to model the joint probability of the whole sequence. Event names are preprocessed as tokens like goal_home to avoid being tokenized into sub-word units. Commentary and event-player pairs are encapsulated in special tokens to help the model distinguish context from labels. See Figure 4 for a schematic overview of the model training input. In training, the model takes the concatenated  sequence as input to perform next token prediction task. At inference time, greedy decoding is used to generate state predictions due to its superior performance compared to beam search and top-k sampling (Hosseini-Asl et al., 2020).

Implementation Details
During preprocessing, we find that 98.1% of comments in the collection are shorter than 200 words, therefore any outliers with a length of more than 200 words are truncated at that point. Then, the input text sequences are tokenized using byte-pair encoding (Sennrich et al., 2016) to avoid out-ofvocabulary words. The sentence embeddings processed by the GRU classifier stem from the pretrained weights of Hug-gingFace's BERT model (Wolf et al., 2019). The GPT-2 model (Radford et al., 2019) is also obtained from HuggingFace with pretrained weights, which are then fine-tuned on SOCCER 1 .

Evaluation
Accuracy, and recall for occurrences of all eventtypes are used to assess the performance of both models. Due to the sparsity of event occurrences, recall is crucial to track the models' ability to extract events given the full set of types. For convenience, we refer to event types with ground truth none answers as negative cases and positive cases otherwise. Therefore, recall among event occurrences is referred to as positive recall in the tables. More specifically, in Tables 3 and 5, accuracy and positive recall are measured on all labels (positive and negative combined). In Table 4, the performance is reported on positive labels only, and detailed metrics including precision, recall and F1 scores are provided.

Results
This section reports the results on the test set of SOCCER. As a naïve distributional baseline, we compute the ratio of negative cases in the test set to be 0.9766.
In Table 3, both models achieve an accuracy that is approximately equal to this majority class baseline due to the heavily imbalanced distribution of event positives and negatives. While accuracy scores are very high, positive recall is much lower, indicating that many event occurrences are missed by the models. When comparing the GPT-2 model's performance on both team level and player level event recognition 2 , we notice that player level recall is substantially worse than that on team-level. This result suggests that complex state tracking involving broad ranges of possible slot values is a comparatively harder task that may require more sophisticated approaches.

Results Per Event Type
In addition to these general results, we break down model performance of positive cases by event-type and provide additional metrics including precision, recall and F 1 scores (see Table 4). When associating the scores with the event type distribution (see Table 1), we can observe that, generally, greater numbers of available data points result in better performance. Take the event type goal as an example. According to Table 1 there are about 800 more positive cases of the event e (t) goal, home than e (t) goal, guest . A difference that is reflected in all the metrics in Table 4 for both models. Another interesting point to note is the performance gap between the GRU classifier and GPT-2 model on the event type red card. The red card event is extremely rare in SOCCER as illustrated in Table 1. Though we observe the performance of both models on red card events to be comparably lower than those of the other events, the GRU classifier is able to capture more positive cases while no occurrences are detected by GPT-2.

Results on Varying Information Densities
In Section 5.1, we have shown that a key difference between SOCCER and other state tracking datasets lies in its low information density (See Table 2 for a detailed comparison). It is conceivable that such differences in information density affect state tracking performance. To eliminate confounding effects introduced via direct comparison to other datasets, this section explores the connection between event density across pieces   of commentary and model performance. We begin by discarding all but the truly event related comments in each match to obtain a subset containing 0% negative cases. This subset contains 25,934 event related comments across all matches. Then, by randomly replacing positive comments 3 with negative ones from the same match at a sparsity ratio r ∈ {20%, 40%, 60%, 80%}, we keep the total number of comments at the same constant count of 25,934 and keep the temporal ordering of comments intact, while effectively reducing the level of information density. Table 5 reports accuracy and positive recall for both methods and task levels when training and evaluating on non-overlapping splits of the newly constructed subsets. Note that, despite our earlier discussion of information density, Table 5 reports a converse notion, sparsity. In this setting, 0% corresponds to the highest and 80% the lowest information density. Comparing accuracy at different event sparsity levels, we notice that scores increase as events become more sparsely distributed. This effect stems from the fact that, when we are replacing event related comments with non-event chatter, chance agreement improves as the number of true negatives increases. Positive recall of event occurrences, however, demonstrates an opposing trend, suggesting that the task of recognizing true state updates becomes more challenging the sparser the discourse domain is. This assumption is further supported by the different degree of performance observed on SOCCER vs. existing collections such 3 Positive comments here refer to comments with event occurrences. as MultiWOZ2.1 (Eric et al., 2019), where recall scores of many models range in the mid-fifty percent range.

Conclusion
In this paper, we introduce SOCCER, the first discourse state tracking collection in the sports commentary domain. We propose two different levels of state granularity and provide two performance benchmarks for models ranging from GRU (Cho et al., 2014) for embedding temporal dependency to GPT-2 (Radford et al., 2019) for causal language modeling. The dataset shows a much lower information density than many existing resources on state tracking, making it considerably more challenging. We believe that, in conjunction with the wide vocabulary of player-level notions of state, this property makes SOCCER an exciting resource on which our community can advance discourse state tracking to a broader range of settings than have been studied previously.