Cross-lingual Semantic Representation for NLP with UCCA

This is an introductory tutorial to UCCA (Universal Conceptual Cognitive Annotation), a cross-linguistically applicable framework for semantic representation, with corpora annotated in English, German and French, and ongoing annotation in Russian and Hebrew. UCCA builds on extensive typological work and supports rapid annotation. The tutorial will provide a detailed introduction to the UCCA annotation guidelines, design philosophy and the available resources; and a comparison to other meaning representations. It will also survey the existing parsing work, including the findings of three recent shared tasks, in SemEval and CoNLL, that addressed UCCA parsing. Finally, the tutorial will present recent applications and extensions to the scheme, demonstrating its value for natural language processing in a range of languages and domains.


Introduction
Universal Conceptual Cognitive Annotation (Abend and Rappoport, 2013), abbreviated as "UCCA", is a symbolic meaning representation (MR) that supports human annotation of text with broad coverage. While several meaning representation schemes share this goal , UCCA targets a level of semantic granularity that abstracts away from syntactic paraphrases in a typologicallymotivated, cross-linguistic fashion, building on Basic Linguistic Theory (Dixon, 20102012), an influential framework for linguistic description. The scheme does not rely on language-specific resources, and sets a low threshold for annotator training.
UCCA has been annotated on several corpora of different genres and languages, 1 as summarized in table 1. Pilot studies have been conducted in additional languages. A web-based annotation system is available .
In UCCA, an analysis of a text passage is a directed acyclic graph over semantic elements called units. The principal kind of unit is a scene, which describes an action, movement or state, and is similar to FrameNet's notion of a frame. Figure 1 contains three scenes, evoked, respectively, by the verb took, the noun phrase a repair, and the possessive our. Several elements are exemplified, including participants, secondary relations, and scene linkage. The graph is anchored in the text tokens (the leaves generally correspond to one or more tokens), and relations between units are indicated by the categories assigned to the edges connecting them.
The goals of this tutorial are: to describe the UCCA representation as a linguistic scheme and how it is being used computationally, especially for cross-lingual and multilingual NLP; to familiarize participants with existing UCCA parsers and equip them with the conceptual tools required for desigining new parsers; and to review existing extensions and possible future directions. 2 2 Relevance UCCA resources and applications are valuable for cross-lingual NLP: like Universal Dependencies (Nivre et al., 2020), UCCA's category set can in principle be applied to a wide variety of languages. We took our vehicle in for a repair to the air conditioning .
We took our vehicle in for a repair to the air conditioning It is also cross-linguistically stable, and reflects a level of semantic structure that is usually preserved in translations (Sulem et al., 2015). UCCA has been applied in NLP to text simplification (Sulem et al., 2018b;Sulem et al., 2020), and text-to-text generation evaluation (Birch et al., 2016;Mareček et al., 2017;Choshen and Abend, 2018;Sulem et al., 2018b;Alva-Manchego et al., 2019;Xu et al., 2020). The tutorial will describe the guidelines and rationale behind UCCA, helping potential application designers understand what abstractions it makes. Significant effort has been devoted to building UCCA parsers ( Arviv et al., 2020;Samuel and Straka, 2020;Dou et al., 2020), including a SemEval 2019 shared task on cross-lingual UCCA parsing (Hershcovich et al., 2019b), which had 8 participating teams, as well as CoNLL 2019 and CoNLL 2020 shared tasks on cross-framework and cross-lingual meaning representation parsing Oepen et al., 2020), where 12 and 4 teams, respectively, submitted parsed UCCA graphs. This tutorial will allow researchers interested in UCCA parsing, and more generally graph parsing, deepen their understanding of the framework, and what properties make it unique. The tutorial will include a brief survey of the various approaches taken by existing parsers, and prepare attendees to work on UCCA parsing themselves.
Furthermore, UCCA parsing has been shown to benefit from multi-task learning (Caruana, 1997) with other meaning representations (Hershcovich et al., 2018), although results from the CoNLL 2019 and CoNLL 2020 shared tasks Oepen et al., 2020) show that multi-task meaning representation parsing is difficult. The tutorial will compare and contrast UCCA and other meaning representations, and will thereby inform participants of the potential advantages and difficulties in employing multi-task learning across semantic schemes.
UCCA defines a small inventory of coarse-grained categories so as not to rely on language-specific lexical resources, and can thus in principle be applied to a great variety of languages. This distinguishes UCCA from finer-grained sentence-structural representations like FrameNet (Baker et al., 1998), the Abstract Meaning Representation (Banarescu et al., 2013), which relies on PropBank (Palmer et al., 2005), and Universal Decompositional Semantics (White et al., 2016). For example, FrameNet requires a different ontology for each new language addressed (Ohara et al., 2003;You and Liu, 2005;Borin et al., 2013;Park et al., 2014;Hayoun and Elhadad, 2016;Djemaa et al., 2016), and AMR underwent significant customization to be applicable to Chinese (Li et al., 2016). Decomp takes a different approach to multilinguality, where the parser is required to parse sentences in other languages to their corresponding English semantic forms (Zhang et al., 2018). The tutorial will address contemporary issues in the field, such as the question of how to represent semantic structure multilingually with broad coverage, which is actively being explored from many angles.
While UCCA structures and categories are intentionally coarse, the scheme has a multi-layered architecture, which allows for refinement using additional layers, which serve as "modules" of semantic distinctions. We will give an overview of the recently proposed extensions (to support coreference) and joint parsing experiments (Prange et al., 2019a;Prange et al., 2019b).

Agenda
The planned division of time is as follows: 1. Bird's eye view (45m). Design philosophy, notion of scenes, basic explanation of categories, simple examples.
3. Data and annotation (10m). Overview of annotated data (see §1) and the annotation process and software .

Relation to other representations (15m).
Comparison to other meaning representations  and to UD (Hershcovich et al., 2019a).

Prerequisites
No prior knowledge is assumed about linguistics and typology. The necessary background will be provided as part of the tutorial. However, participants are expected to know about basic data structures such as trees and graphs. For the parsing section, prior knowledge is assumed about common machine learning techniques, including supervised learning and neural networks.

Reading list
The following are recommended to read before the tutorial, as they provide background and frame the context in which the tutorial materials lie: 1. Chapter 3 of Dixon (2005) contains an introduction to some basic concepts in semantics on which UCCA is based.
2. Kiperwasser and Goldberg (2016) present a transition-based parser using an architecture on which TUPA, the first UCCA parser, is based (Hershcovich et al., 2017).
3. Peng et al. (2017) performed multi-task learning for meaning representation parsing, inspiring work on cross-framework parsing for UCCA (Hershcovich et al., 2018). (2017) compare and constrast several meaning representations according to various aspects. (2017) investigate translation divergences using a hierarchical alignment, and discuss bridging them with cross-lingual semantic representations.

Deng and Xue
6. Croft et al. (2017) list typologically-informed design criteria for Universal Dependencies (Nivre et al., 2020), which are also relevant for other structural representations in NLP.

Presenters
The instruction in this tutorial involves organizers at various career stages, exhibiting geographic diversity, and diversity in terms of gender. Omri Abend (https://www.cse.huji.ac.il/~oabend) is a Senior Lecturer (Assistant Professor) of Computer Science and Cognitive Science at the Hebrew University of Jerusalem. Research interests: computational semantics and specifically, cross-linguistically applicable semantic and grammatical representation, semantic parsing, corpus annotation and evaluation. Relevant experience: co-developer of the UCCA scheme, partner in all annotation and application efforts related to UCCA, and in some of the parsing efforts. Publishes regularly in NLP conferences (ACL, NAACL, EMNLP etc.).
Dotan Dvir has been managing the UCCA manual annotation project at the Hebrew University of Jerusalem since 2017. She was involved in writing version 2 of the UCCA guidelines. She has in-depth knowledge of the UCCA guidelines and is experienced in instructing annotators about them. Before joining the UCCA project, she had been working as a text analyst in IBM's Project Debater (2014)(2015)(2016)(2017).
Daniel Hershcovich (https://danielhers.github.io) is a Tenure-Track Assistant Professor at the University of Copenhagen, Denmark. Daniel pioneered the work on UCCA parsing, and is interested in semantic parsing and meaning representations. Daniel develops and maintains the UCCA toolkit Python codebase, 3 has teaching experience in NLP and ML courses, and publishes in NLP conferences.
Jakob Prange (https://prange.jakob.georgetown.domains) is pursuing his Ph.D. at Georgetown University, investigating design, annotation, and parsing strategies for various meaning representations. Among other formalisms (SNACS, frame semantics, STAG, CCG), he has studied and worked with UCCA over the past two years, which recently resulted in two published proposals of novel UCCA extensions, for coreference and semantic roles. He has experience with teaching in multicultural classroom settings and presenting research at international conferences. Nathan Schneider (http://nathan.cl) leads an interdisciplinary computational linguistics research group at Georgetown University. He has worked on the design and parsing of a range of broadcoverage representations for different aspects and granularities of meaning, including multiword expressions, supersenses, frame semantics, AMR, and UCCA (as a multiyear collaboration with the copresenters). He has experience teaching meaning representations in classroom settings as well as conference tutorials-notably, a tutorial on AMR (Schneider et al., 2015) whose materials 4 continue to serve as a useful introduction to the scheme, and will serve as a model for the proposed UCCA tutorial.