The Parallel Meaning Bank: Towards a Multilingual Corpus of Translations Annotated with Compositional Meaning Representations

The Parallel Meaning Bank is a corpus of translations annotated with shared, formal meaning representations comprising over 11 million words divided over four languages (English, German, Italian, and Dutch). Our approach is based on cross-lingual projection: automatically produced (and manually corrected) semantic annotations for English sentences are mapped onto their word-aligned translations, assuming that the translations are meaning-preserving. The semantic annotation consists of five main steps: (i) segmentation of the text in sentences and lexical items; (ii) syntactic parsing with Combinatory Categorial Grammar; (iii) universal semantic tagging; (iv) symbolization; and (v) compositional semantic analysis based on Discourse Representation Theory. These steps are performed using statistical models trained in a semi-supervised manner. The employed annotation models are all language-neutral. Our first results are promising.


Introduction
There is no reason to believe that the ingredients of a meaning representation for one language should be different from that for another language.Hence, a meaning-preserving translation from a sentence to another language should, arguably, have equivalent meaning representations.Hence, given a parallel corpus with at least one language for which one can automatically generate meaning representations with sufficient accuracy, indirectly one also produces meaning representations for aligned sentences in other languages.The aim of this paper is to present a method that implements this idea in practice, by building a parallel corpus with shared formal meaning representations, that is, the Parallel Meaning Bank (PMB).
Recently, several semantic resources-corpora of texts annotated with meanings-have been developed to stimulate and evaluate semantic parsing.Usually, such resources are manually or semiautomatically created, and this process is expensive since it requires training of and annotation by human annotators.The AMR banks of Abstract Meaning Representations for English (Banarescu et al., 2013) or Chinese and Czech (Xue et al., 2014) sentences, for instance, are the result of manual annotation efforts.Another example is the development of the Groningen Meaning Bank (Bos et al., 2017), a corpus of English texts annotated with formal, compositional meaning representations, which took advantage of existing semantic parsing tools, combining them with human corrections.
In this paper we propose a method for producing meaning banks for several languages (English, Dutch, German and Italian), by taking advantage of translations.On the conceptual level we follow the approach of the Groningen Meaning Bank project (Basile et al., 2012), and use some of the tools developed in it.The main reason for this choice is that we are not only interested in the final meaning of a sentence, but also in how it is derived-the compositional semantics.These derivations, based on Combinatory Categorial Grammar (CCG, Steedman, 2001), give us the means to project semantic information from one sentence to its translated counterpart.
The goal of the PMB is threefold.First, it will serve as a test bed for cross-lingual compositional semantics, enabling systematic studies of the challenges arising from loose translations and different semantic granularities.The second goal is to produce data for building semantic parsers for languages other than English.This, in turn, will help with the third, long-term goal, which concerns the process of translation itself.Human translators purposely change meaning in translation to yield better translations (Langeveld, 1986).The third goal is thus to develop methods to automatically detect such shifts in meaning.

Languages and Corpora
The foundation of the PMB is a large set of raw, parallel texts.Ideally, each text has a parallel version in every language of the meaning bank, but in practice, having a version for the pivot language (here: English) and one other language is sufficient for our purposes.Another criterion for selection is that freely distributable texts are preferable over texts which are under copyright and require (paid) licensing.
Besides English we chose two other Germanic languages, Dutch and German, because they are similar to English.We also include one Romance language, Italian, in order to test whether our method works for languages which are typologically more different from English.
These corpora are divided over 100 parts in a balanced way.Initially, two of these parts, 00 and 1 https://tatoeba.org 2 http://gutenberg.org,http://etc.usf.edu/lit2go, http://gutenberg.spiegel.de10, are selected to be the gold standard (and thus will be manually annotated).This ensures that the gold standard represents the full range of genres.

Automatic Annotation Pipeline
Our goal is first to richly annotate the English corpus, with annotations ranging from segmentation to deep semantics, and then project these annotations to the other languages via alignment.The annotation consists of several layers, each of which will be presented in detail below.Figure 1 gives an overview of the pipeline while Figure 2 shows the annotation example.

Segmentation
Text segmentation involves word and sentence boundary detection.Multiword expressions that represent constituents are treated as single tokens.Closed compound words that have a semantically transparent structure are decomposed.For example, impossible is decomposed into im and possible while Las Vegas and 2 pm are analysed as a single token.In this way we aim to assign 'atomic' meanings to tokens and avoid redundant lexical semantics.Segmentation follows an IOB-annotation scheme on the level of characters, with four labels: beginning of sentence, beginning of word, inside a word, and outside a word.We use the same statistical tokenizer, Elephant (Evang et al., 2013), for all four languages, but with language-specific models.Figure 2: Document 00/3178: Projection of the annotation from English to German.The source sentence is annotated, in this order, with semtags, symbols, CCG categories and lexical semantics.The DRS for the whole sentence is obtained compositionally from the lexical DRSs.

Syntactic Analysis
We use CCG-based derivations for syntactic analysis.The transparent syntax-semantic interface of CCG makes the derivations suitable for widecoverage compositional semantics (Bos et al., 2004).CCG is also a lexicalised theory of grammar, which makes cross-lingual projection of grammatical information from source to target sentence more convenient (see Section 4).
The version of CCG that we employ differs from standard CCG: in order to facilitate the crosslingual projection process and retain compositionality, type-changing rules of a CCG parser are explicated by inserting (unprojected) empty elements which have their own semantics (see the token ∅ in Figure 2).
For parsing, we use EasyCCG (Lewis and Steedman, 2014), which was chosen because it is accurate, does not require part-of-speech annotation (which would require different annotation schemes for each language) and is easily adaptable to our modified grammar formalism.

Universal Semantic Tagging
To facilitate the organization of a wide-coverage semantic lexicon for cross-lingual semantic analyses, we develop a universal semantic tagset.The semantic tags (semtags, for short) are languageneutral, generalise over part-of-speech and named entity classes, and also add more specific information when needed from a semantic perspective.
Given a CCG category of a token, we specify a general schema for its lexical semantics by tagging the token with a semtag.
Currently the tagset comprises 80 different finegrained semtags divided into 13 coarse-grained classes (Bjerva et al., 2016).We do not list all possible semtags here, but give some examples instead.For instance, the semtag NOT marks negation triggers, e.g., not, no, without and affixes, e.g., imin impossible; the semtag POS is assigned to possibility modals, e.g., might, perhaps and can.ROL identifies roles and professions, e.g., boxer and semanticist, while CON is for concepts like table and wheel.Distinguishing roles from concepts is crucial to get accurate semantic behaviour. 3e use the semantic tagger based on deep residual networks.It works directly on the words as input, and therefore requires no additional languagespecific features.The first results on semantic tagging, with an accuracy of 83.6%, are reported by Bjerva et al. (2016).

Symbolization
The meaning representations that we use contain logical symbols and non-logical symbols.The latter are based on the words mentioned in the input text.We refer to this process as symbolization.It combines lemmatization with normalization, and performs some lexical disambiguation as well.For example, male is the symbol of the pronouns he and himself, europe of the adjective European, and 14 : 00 for the time expression 2 pm.A symbol together with a CCG category and a semtag are sufficient to determine the lexical semantics of a token (see Figure 2).Some function words do not need symbols since their semantics are expressed with logical symbols, e.g., auxiliary verbs, conjunctions, and most determiners.
Notice that the employed symbols are not as radical and verbalized as the concepts in AMRs, e.g., the symbol of opinion is opinion rather than opine.First, using deep forms as symbols often makes it difficult to recover the original and semantically related forms, e.g., if opinion had the symbol opine, then it would be difficult to recover opinion and its semantic relation with idea.Second, alignment of translations does not always work well with deep forms, e.g., opinion can be translated as parere in Italian and mening in Dutch, but it is unnatural to align their symbols to opine.After all, having such alignments would make it difficult to judge good and bad translations, which is one of the goals of the PMB.
The symbolizer could either be implemented as a rule-based system with multiple modules, or as a system that learns the required transformations from examples.The advantage of the latter is that it is more robust to typos and other spelling variants without manual engineering.To evaluate the feasibility of this approach, we built a character-based sequence-to-sequence model with deep recurrent neural networks, which uses words, semtags, and additional data from existing knowledge sources, such as WordNet (Fellbaum, 1998), Wikipedia, and UNECE codes for trade4 , to do symbolization.We are currently investigating how the performance of machine learning-based symbolizer compares to a rule-based one incorporating the lemmatizer Morpha (Minnen et al., 2001).

Semantic Interpretation
Discourse Representation Theory (DRT, Kamp and Reyle, 1993), is the semantic formalism that is used as a semantic representation in the PMB.It is a well-studied theory from a linguistic semantic viewpoint and suitable for compositional semantics. 5Expressions in DRT, called Discourse Representation Structures (DRSs), have a recursive structure and are usually depicted as boxes.An upper part of a DRS contains a set of referents while the lower part lists a conjunction of atomic or compound conditions over these referents (see an example of a DRS in the bottom of Figure 2).
Boxer (Bos, 2015), a system that employs λ-calculus to construct DRSs in a compositional way, is used to derive meaning representations of the documents.However, the original version of Boxer is tailored to the English language.We have adapted Boxer to work with the universal semtags rather than English-specific part-ofspeech tags.Boxer also assigns VerbNet/LIRICS thematic roles (Bonial et al., 2011) to verbs so that the lexical semantics of verbs include the corresponding thematic predicates (see came in Figure 2).
Hence an input to Boxer is a CCG derivation where all tokens are decorated with semtags and symbols.This information is enough for Boxer to assign a lexical DRS to each token and produce a DRS for the entire sentence in a compositional and language-neutral way (see Figure 2).

Cross-lingual Projection
The initial annotation for Dutch, German and Italian is bootstrapped via word alignments.Each non-English text is automatically word-aligned with its English counterpart, and non-English words initially receive semtags, CCG categories and symbols based on those of their English counterparts (see Figure 2).
CCG slashes are flipped as needed, and 2:1 alignments are handled through functional composition.Then, the CCG derivations and DRSs can be obtained by applying CCG's combinatory rules in such a way that the same DRS as for the English sentence results (Evang and Bos, 2016;Evang, 2016).
If the alignment is incorrect, it can be corrected manually (see Section 5).The idea behind this way of bootstrapping is to exploit the advanced state-of-the-art of NLP for English, and to encourage parallelism between the syntactic and semantic analyses of different languages.
To facilitate cross-lingual projection, alignment has to be done at two levels: sentences and words.Sentence alignment is initially done with a simple tions, anaphora and conventional implicatures in a generalized way.
one-to-one heuristic, with each English sentence aligned to a non-English sentence in order, to be corrected manually.Subsequently, we automatically align words in the aligned sentences using GIZA++ (Och and Ney, 2003).
Although we use existing tools for the initial annotation of English and projection as the initial annotation of non-English documents, our aim is to train new language-neutral models.Training new models on just the automatic annotation will not yield better performance than the combination of existing tools and projection.However, we improve these models constantly by adding manual corrections to the initial automatic annotation, and retraining them.In addition, this approach lets us adapt to revisions of the annotation guidelines.

Adding Bits of Wisdom
For each annotation layer, manual corrections can be applied to any of the four languages.These annotations are called Bits of Wisdom (BoWs, following Basile et al. ( 2012)), and they overrule the annotations of the models if they are in conflict.Based on the BoWs, we distinguish three disjoint classes of annotation layers: gold standard (manually checked), silver standard (including at least one BoW) and bronze standard (no BoWs).Table 1 shows how these classes are distributed across languages and documents.

Layer
Lang Gold Silver Bronze In addition to adding BoWs in general, we also use annotations to improve the models in a more targeted way, by focusing on annotation conflicts.Annotation conflicts arise when a certain annotation layer for a document has manually checked and marked 'gold'.When the automatic annotation of such a layer changes, e.g., after retraining a model, new annotation errors might be introduced, and these are marked as annotation conflicts.The annotation conflicts are then slated for resolution by an expert annotator.This has two main benefits: it concentrates human annotation efforts on difficult cases, for which the models' judgements are still in flux, so that the bits of wisdom can steer the model more effectively.In addition, by enforcing conflicts to be re-judged by a human, we have a chance to correct human errors and inconsistencies, and, if necessary, improve the annotation guidelines.

Conclusion
Our ultimate goal is to provide accurate, languageneutral natural language analysis tools.In the pipeline that we presented in this paper, we have laid the foundation to reach this goal.For every task in the pipeline-tokenization, parsing, semantic tagging, symbolization, semantic interpretation-we have a single component that uses a language-specific model.We proposed new language-neutral tagging schemes to reach this goal (e.g., for tokenization and semantic tagging) and adapted existing formalisms (making CCG more general by introducing lexical categories for empty elements).
Our first results for Dutch show that our method is promising (Evang and Bos, 2016), but we still need to assess how much manual effort is involved in other languages, such as German and Italian.We will also explore the idea of combining CCG parsing with Semantic Role Labelling, following Lewis et al. (2015), and whether we can derive word senses in a data-driven fashion (Kilgarriff, 1997) rather than using WordNet.Furthermore, we will assess whether our cross-lingual projection method yields accurate tools with time and annotation costs lower than would be needed when starting from scratch for a single language.
The annotated data of the PMB is now publicly accessible through a web interface. 6Stable releases will be made available for download periodically.

Figure 1 :
Figure 1: Annotation pipeline of the PMB.Manual corrections can be added at each annotation layer.
Dutch one and 42% an Italian one.9% have German and Dutch, 6% have Dutch and Italian and 18% have Italian and German.5% exist in all four languages.

Table 1 :
Number of gold, silver and bronze documents per layer and language, as of 13-02-2017.