Mapping the PERFECT via Translation Mining

Semantic analyses of the PERFECT often defeat their own purpose: by restricting their attention to ‘real’ perfects (like the English one), they implicitly assume the PERFECT has predefined meanings and usages. We turn the tables and focus on form, using data extracted from multilingual parallel corpora to automatically generate semantic maps (Haspelmath, 1997) of the sequence ‘HAVE/BE + past participle’ in five European languages (German, English, Spanish, French, Dutch). This technique, which we dub Translation Mining, has been applied before in the lexical domain (Wälchli and Cysouw, 2012) but we showcase its application at the level of the grammar.


Introduction
The PERFECT is a diachronically and linguistically unstable category (Lindstedt, 2000) and is subject to widespread cross-linguistic variation. We zoom in on the HAVE PERFECT that Dahl and Velupillai (2013) trace back to a transitive possessive construction, and manifests itself mainly in Western European languages. Despite extensive literature on the PERFECT, the goal of providing a proper semantics has not been reached (Ritz, 2012).
We propose to use semantic maps (Haspelmath, 1997) for this purpose. Semantic maps are geographical layouts that graphically represent how meanings of grammatical functions are related to each other. While current formal semantic approaches to the PERFECT (e.g. Portner (2003)) are driven by sets of predefined usages exemplified by prototypical instantiations, we aim to generate semantic maps directly from data.
We believe multilingual parallel corpora are an excellent source for this. Translation equivalents provide us with form variation across languages in contexts where the meaning is stable. Parallel corpora have been frequently used in the domain of lexical semantics (e.g. Dyvik (1998)). We showcase a method (adapted from Wälchli and Cysouw (2012)) to create semantic maps directly from multilingual parallel corpora, and adapt it to the level of grammar. We focus on a set of five European languages (German, English, Spanish, French, Dutch), although the methodology can easily be adapted to include more languages.
Linguists commonly distinguish the three core PERFECT meanings in (1): (1) a. Mary has visited Paris.
(her past visit is relevant now) [experiental] b. Mary has moved to Paris.
(she currently lives in Paris) [resultative] c. Mary has lived in Paris for five years (now).
(she moved there five years ago) [continuative] The resultative meaning in (1b) is thought to constitute the core of the PERFECT. However, (2) (taken from the subtitles of "Body of Proof") shows that the same meaning of a past event and a result with current relevance can be conveyed by a PAST, PERFECT or PRESENT.
(2) a. In case you hadn't noticed, we just got a confession.
[ Taking (1) as a starting point for cross-linguistic variation, and ignoring other tense-aspect forms (as in (2)) would lead to a skewed view on variation and on the PERFECT itself. As Ritz (2012) states, the PERFECT is the 'shapeshifter' of tenseaspect categories, and adapts its meaning to fit into a given system. Our final goal is to provide a compositional semantics of the PERFECT across languages that takes the variation in (2) and (2) into account. The competing, form-based methodology that we outline in the next section constitutes the stepping stone that enables us to reach this goal.

Methodology
To construct semantic maps directly from data extracted from multilingual parallel corpora, we apply an existing method in the lexical domain (Wälchli and Cysouw, 2012) at the level of grammar. We dub our method Translation Mining. In the following paragraphs, we lay out the method in detail.

Step 1) Extraction of PERFECTs
In the first phase, we extract fragments containing verbs phrases that match the 'HAVE/BE + past participle' pattern from the EuroParl corpus (Tiedemann, 2012). To do so, we modify an existing algorithm by van der Klis et al. (2015), that takes care of three difficulties in extracting these forms from corpora: (1) words between the auxiliary verb and the past participle, (2) lexical restrictions for BE in French, German and Dutch and (3) a reversed order in subordinate clauses in German and Dutch.
The algorithm searches each of the five lan-guages under investigation (German, English, Spanish, French and Dutch) for PERFECTs and will then return the aligned sentences in the other languages. This yields five-tuples of fragments consisting of at least one PERFECT. Note that this approach is necessary to find the triplet in (2), because only in German a PERFECT is involved. This scheme therefore allows for competing forms within a language to enter the realm of investigation. Also, taking five languages into account will create a broader perspective on the semantics of the PERFECT than monolingual research would do. 1

Step 2) Word-level alignment of verb phrases
After extracting fragments containing a PERFECT in step 1, we asked a single human annotator (a BSc student proficient in all languages under investigation) to mark the corresponding verb phrases in the aligned fragments. To facilitate the annotator's job we created a web application (dubbed TimeAlign) that allows users to see two aligned fragments (a "source" and a "translation") and to mark the corresponding verb phrase in the target language. 2 The annotator can also signal 1 The source code of this algorithm can be found on GitHub: https://github.com/UUDigitalHumanitieslab/ perfectextractor. 2 The source code of this application can be found on GitHub: https://github.com/UUDigitalHumanitieslab/ timealign. The application has been built in Django, a Python web framework (https://www.djangoproject.com/). Figure 1: The annotation interface used in step 2. The annotator can select (by clicking on words) a suitable translation for the marked words in the source fragment, or use the checkboxes to mark the source as not being a PERFECT or as the translated fragment as an incorrect translation of the source fragment.  when the target fragment is not a correct translation of the source, or when the verb phrase in the source is in fact not a PERFECT (see Figure 1). Fragments without a PERFECT in the source and incorrect translations are removed from the dataset. The remaining pairs are merged back into five-tuples.
Step 2 thus yields five-tuples of verb phrases, at least one of which (the source) is a PERFECT.

Step 3) Tense attribution
In the third step, we assign tenses to the verb phrases marked in the translations (see step 2). For the tense labelling, we opted for the categories displayed in Table 1. The tenses are automatically or manually assigned, depending on the level of detail of part-of-speech tags per language. The tense attribution for English, French and Dutch is straightforward: we used the part-of-speech tagging of the EuroParl corpus to retrieve the label. 3 However, for German and Spanish we opt for manual tense attribution, because the part-of-speechtagging of the auxiliary verbs in EuroParl was too coarse-grained.

Step 4) Dissimilarity matrix
The tense attribution process of step 3 yields fivetuples of aligned tense attributions (see Table 2 for 3 The source code of this algorithm is part of TimeAlign, see link above.  -2/5 2/5 #2 2/5 -4/5 #3 2/5 4/5 - Table 3: Dissimilarity matrix for the example tense attributions in Table 2. an example outcome). We design a simple distance function: we define five-tuples to be similar (distance = 0) if all the tense attributions match up, if not, we add 1 for each mismatch and divide the sum by 5. We use the distance function on the five-tuples to create a (dis)similarity matrix. Table  3 shows an application of the distance function and the resulting matrix.
We decided to remove five-tuples from the results in which one of the translations was missing or contained a non-verbal translation. We believe including these examples in the current pilot study, with a limit dataset and only five languages in total, would have a negative effect on our analyses. We will address this issue in future research. Figure 2: Visualization of the dissimilarity matrix via multidimensional scaling. The points are labelled using the tenses of the selected language. Users can also change the dimensions shown. Clicking on a point allows to inspect a single five-tuple (example shown in Figure 3).

Step 5) Visualization via multidimensional scaling
The resulting matrix from step 4 is then plotted using multidimensional scaling (MDS) 4 . On top of that, we created an interactive visualization (see Figure 2). This visualization shows which space the various tenses (PERFECT and other) occupy on the map, and thus enables researchers to see how tenses interact within a language. The visualization also allows for comparison between languages, because it uses a color labeling that remains constant between languages (e.g. the German Perfekt has the same color as the English present perfect). Furthermore, being able to filter tenses allows to focus on one specific tense or interaction between specific tenses. The researcher can also choose to show other dimensions of the MDS algorithm, which facilitates interpretation. Hovering over a point on the map directly shows you the five-tuple the point is based on, and clicking on a point will yield a new page in which you can inspect the underlying data (see Figure 3 for an example). 5 Compared to Wälchli and Cysouw (2012), our main contributions in this methodology are (1) the web application to allow for easier annotation and (2) the interactive visualization of the MDS algo- 4 We use the MDS algorithm from the scikit-learn package (Pedregosa et al., 2011), a Python package for machine learning, and visualized the results using the nvd3 package (http://nvd3.org/). 5 The source code of this visualization is part of TimeAlign, see link above. ES  FR  NL  PERFECT  360 347 371 481 438  PRESENT  19  18  47  20  20  PAST  124 146 89 7  8 8  36  PAST PERFECT  4  1  3  2  18  other  5  -2 1 - rithm, which allows for researchers to more easily compare PERFECT usage within and across languages, as well as interpret dimensions.

Premilinary results
In this pilot study we analyzed a small part of the Q4/2000 portion of the EuroParl corpus. 6 Running the Translation Mining methodology on this corpus yielded 512 complete five-tuples in total. We first observe the descriptive statistics in Table 4 that result from mapping the languagespecific tense labelling in step 3 to more generic tenses (e.g. simple past to PAST, see Table 1). As is commonly reported in literature (see de Swart (2007) and references therein), the French passé composé takes responsibility for a wide range of PERFECT uses. In German and English one tends to use PAST for quite a few contexts where French would use the passé composé. In Spanish, the presente also competes with the PAST in this respect. Figure 3: Detailed view of a five-tuple of fragments. The "source" fragment shows the extracted sentence from step 1 with the PERFECT marked in green. The "translations" are the aligned fragments with manually annotated verb phrases from step 2 and semi-automatically annotated tenses from step 3.
Moving from descriptive statistics to the MDS visualization, we look at dimensions governing the competition between languages. The German data (depicted in Figure 2) is most obvious in this respect, where the x-axis shows a transition from PERFECT to unmarked (aspectual perspective), and the y-axis from PRESENT to PAST (temporal orientation). However, this attribution is not so easily translated into other languages, even though in each language we do find clear clusters of PERFECT use.
In the visualization, we can also look at outliers to find cases where one language is different from the other languages. We can confirm e.g. that English requires a PAST with a locating time adverbial, whereas German, Dutch and French tolerate a PERFECT in this configuration. Spanish patterns with English (see Schaden (2009)) in this respect. An example of this phenomenon can be found in (3) below.
( 3)  Another interesting outlier is the RECENT PAST, available for French and Spanish. This periphrastic tense signals recency and is expressed in German, English and Dutch through the use of a PERFECT combined with an additional time adverbial: gerade, just, kortgeleden respectively, see (4) below. A tentative conclusion could be that the RECENT PAST is a dimension of the PAST or of the PERFECT, but in both cases this recency requires additional marking.

Discussion
The interactive maps allowed us to reproduce earlier research (e.g. de Swart (2007), Schaden (2009)), but also to draw new conclusions on the tense/aspect role of the PERFECT across languages. Our methodology can be applied to a wide range of grammatical phenomena. There are some remaining issues though. First of all, interpreting the results of the MDS algorithm is more qualitative than quantitative. While the visualization helps researchers to form ideas on the role of the PERFECT, these intuitions will need to be supported by statistics. We are currently looking into applying Analysis of Similarities (ANOSIM, Clarke (1993)) on the (dis)similarity matrices to pair this with the MDS visualization.
A second limitation is that the EuroParl corpus contains only political dialogue, and therefore might not cover the whole range of PER-FECT use. We should also check for register variation. Our plan is to repeat our methodology on the OpenSubtitles2016 corpus (Lison and Tiedemann, 2016), as well as to find (or create) a multilingual parallel corpus of literary texts.
Lastly, we think the distance function we now use might be too simplistic. It considers all tense differences to be equal, even though it is quite clear that e.g. a PRESENT is semantically more distant from a PAST PERFECT than a PERFECT. Also, there is no cross-language comparison. We plan to experiment with the distance function to finetune our results.