Semantically Inspired AMR Alignment for the Portuguese Language

Abstract Meaning Representation (AMR) is a graph-based semantic formalism where the nodes are concepts and edges are relations among them. Most of AMR parsing methods require alignment between the nodes of the graph and the words of the sentence. However, this alignment is not provided by manual annotations and available automatic aligners focus only on the English language, not performing well for other languages. Aiming to fulﬁll this gap, we developed an alignment method for the Portuguese language based on a more semantically matched word-concept pair. We performed both intrinsic and extrinsic evaluations and showed that our alignment approach outperforms the alignment strategies developed for English, improving AMR parsers, and achieving competitive results with a parser designed for the Portuguese language.


Introduction
According to Banarescu et al. (2013), Abstract Meaning Representation (AMR) is a semantic meaning representation, which may be encoded as a rooted Direct Acyclic Graph (DAG) where the nodes are concepts, and the edges are relations among them. This representation explicitly details semantics information, as depicted in Figure 1. In this figure, the live-01 node is the root of the graph and city node introduces a named entity. Moreover, :ARGx relations are predicates from the PropBank lexicon (Kingsbury and Palmer, 2002), which encode semantic information according to each PropBank sense.
To parse a text into an AMR graph, most of the AMR parsers require alignment between the word (tokens) of the sentence and the nodes of the corresponding graph (see, for instance, (Flanigan et al., 2014;Wang et al., 2015;Zhou et al., 2016;Damonte et al., 2017). However, this anchoring is not provided by manual annotations. Also, the available automatic aligners focus only on the English language (Pourdamghani et al., 2014;Flanigan et al., 2014;Liu et al., 2018), and they do not perform well for other languages. For Portuguese, for instance, the sentence "Não era surpresa para mim" (It was no surprise to me), the JAMR aligner (Flanigan et al., 2014) produces alignment only between the token surpresa (surprise) and the node surpresa, as shown in Figure 2.
The JAMR aligned only the span 2-3, which is the token surpresa (surprise), with node 0, which is the root of the graph; -and eu nodes were not aligned. This wrong or bad alignments occur because of the JAMR aligner adopts a string-match strategy that is focused on the English language. Besides, these issues contribute to a decrease in the performance of AMR parsers. As a result, recent AMR parsing methods have focused on alignmentfree approaches (Lyu and Titov, 2018;Zhang et al., 2019a,b). However, they require a large annotated corpus, which is available only for English.
In this context, aiming to bridge this lack of resources and tools for other languages, we propose an AMR aligner for Portuguese that focuses on a more semantically matched word-concept pair. For that, we use pre-trained word embeddings and the Word Mover's Distance (WMD) function (Kusner et al., 2015) to match span tokens in the sentence with nodes in the graph. Word embeddings capture some semantics information about a corpus, and WMD measures the dissimilarity between two documents even if they have no words in common. With this, it is possible to produce semantically inspired matches instead of only string-match.
To evaluate our approach, we carry out both intrinsic and extrinsic experiments on an annotated corpus from Portuguese. Our aligner produced better alignments than alignment strategies proposed for English and improved AMR parsing for Portuguese, reaching competitive results with an AMR parser designed for that language.
The remaining of this paper is organized as follows. In Section 2, we briefly introduce the related work. Section 3 describes our proposed aligner. In Section 4, we conduct some experiments and evaluations and show our results. Finally, Section 5 concludes the paper indicating future research. Flanigan et al. (2014) developed the first AMR aligner, named JAMR. The authors created a rulebased aligner with fourteen heuristic rules to greedily align concepts in the nodes of the graph with tokens in the sentence. The alignment format is a space-separated list of spans with their graph fragments, where each a descriptor specifies each node (e.g. (Gorn, 1965)): 0 for the root node, 0.0 for the first child of the root node, 0.1 for the second child of the root node and so forth. For example, for the sentence, "The boy wants to go.", the JAMR generates alignments according to Figure 3. The JAMR aligned the spans 2-3, 4-5, and 1-2 (that refer to wants, go, and boy, respectively) with the nodes 0, 0.1, and 0.0 (that are the root of the graph, the second child of the root, and the first child of the root, respectively). Pourdamghani et al. (2014) adopted an unsupervised word alignment technique with machine learning. The authors followed a syntax-based Statistical Machine Translation (SMT) according to the IBM word alignment model (Brown et al., 1993) to align linearized AMR graphs with English sentences. For the sentence "The boy wants to go", this approach produces alignments, as shown in Figure 4, where '∼ n' specifies a link to the nth English word. As we can see, the third token (wants) was aligned with the concept want-01, the second token (boy) with the concept boy, and the fifth token (go) with the concept go-01.

Related Work
The boy wants to go Figure 4: Alignments produced for the sentence "The boy wants to go". Liu et al. (2018) extended and improved the JAMR aligner by adding semantic resources into the rules, such as GloVe embeddings (Pennington et al., 2014) and the Morphosemantic database (Fellbaum et al., 2009). Besides, they noted that the JAMR aligner requires that words have at least a common longest prefix of four characters, omitting the shorter cases (as word actions aligned with the concept act-01). Thus, their method improved the JAMR aligner in a 4.6% f-score on the LDC2014T2 corpus. The authors also showed that their aligner improved the JAMR (Flanigan et al., 2014) and CAMR (Wang et al., 2015) parsers.

Our Aligner
In order to properly adapt AMR parsers from English to Portuguese, we developed an alignment strategy based on document similarity for the Portuguese language. Our method produces alignments in the JAMR aligner format since most of the AMR parsers adopt this alignment type. To support our method, we used the pre-trained GloVe 1 embeddings of 100 dimensions for the Portuguese language (Hartmann et al., 2017) and some lexical resources. We organized our method into three phases over input annotated sentences: preprocessing, mapping, and aligning.
In the first step, we tokenize the sentences and lemmatize each token, applying the Stanza tool (Qi et al., 2020) trained for Portuguese. The Portuguese tokenization is slightly different from English. For example, some hyphenated words, as "via-me" and "ouvi-la" (translated for "saw me" and "hear her"), should be separated by the hyphen, whereas other words, as "segunda-feira" and "recém-casados" (translated for "Monday" and "newly married"), should not be separated. To detail the next steps, we will use Figure 5 as an example. In the next step, we mapped each concept to its respective position in the graph. One can see that we mapped the contrast-01 concept to the root of the graph 0, its child responder-01 to 0.0, and their childrenand person to their respective positions 0.0.0 and 0.0.1. To do this, we used the Penman tool (Goodman, 2020).
In the last step, we align the word tokens of the sentence with the concepts of the graph. The AMR language has two concept types: concrete and abstract (or special keywords) ones. The former are those that are explicitly present in the sentence, while the latter are not. In Figure 5, we can see that the responder-01 is a concrete concept, since it is in the sentence, while the contrast-01, person, and name concepts are abstract 2 .
To align concrete concepts, we use the Word Mover's Distance (WMD) 3 (Kusner et al., 2015) and the pre-trained GloVe embeddings of 100 dimensions. The WMD is a distance function where the lower distance value indicates a higher similarity to the documents. It measures the minimum amount of distance that embedded words of one document need to "travel" to reach the embedded words of another document.
We used this distance function to evaluate a distance between the embedded word tokens in the sentence and the embedded concepts in the graph to produce alignments with more semantics information than string-match. Furthermore, we empirically defined a maximum distance (threshold) of 1.5 to match a token with a concept, i.e., our strategy maps a word with a concept only if the distance between them is less than the defined threshold. Figure 6 shows this strategy to align the words of the sentence with concrete concepts of the graph. .., w n } is the set of words of a sentence and C = {c 1 , ..., c n } is the set of concrete concepts of the graph. Our method aligns a w i with a c j if and only if the WMD value between w i and c j is lower than 1.5, and that value is the lowest among the other delta values.
To align abstract concepts, we use some lexical resources (list of words) to aid and get a higher recall in the alignment. At this time, the AMR formalism has 44 abstract concepts 4 and 110 concepts that represent named entities 5 . For instance, in Figure 5, the person is a concept that produces a named entity and, as this concept is in the resource, our alignment strategy aligns this concept and its children with the Pedro (Peter) token, which is the span 1-2, generating the alignment 1-2|0.0.1+0.0.1.0+0.0.1.0.0, which means that the span 1-2 is aligned with the concept person (0.0.1) plus name (0.0.1.0) plus "Pedro" (0.0.1.0.0).
In addition to named entities and abstract concepts, AMR concepts encompass contrastive conjunctions and negations. To align these concepts, we also created two more lexical resources 6,7 (list of words) for these concepts. Likewise, as ab-stract concepts and named entities, as the words Mas (But) and não (not) are in the resources, our alignment method aligns the spans 0-1 and 2-3, which are the words Mas (But) and não (not), respectively, with nodes 0 and 0.0.0, which are contrast-01 and -, respectively (see Figure 5). Our alignment tool is available at http://github. com/rafaelanchieta/amr-aligner. In what follows, we detail our experiments with the aligner and the obtained results.

Experiments and Results
We performed two experiments, one intrinsic and another extrinsic. In the first, we randomly chose and manually aligned one hundred sentences with their respective AMRs from the Little Prince corpus (Anchiêta and Pardo, 2018a). Then, we compared the manual alignment with the alignments produced by Flanigan et al. (2014)  As we can see, our aligner outperformed those developed for English, which means that our alignment strategy produced alignments more consistent with those manually produced. To get these values, we followed the evaluation method of Flanigan et al. (2014).
In order to confirm the intrinsic evaluation results, we performed an extrinsic evaluation. Thus, we adapted the AMR parsers of Damonte et al. (2017) (henceforth, we refer to it as AMREager) and Wang et al. (2015) (henceforth, we refer to it as CAMR) for the Portuguese language. We chose these parsers because they use alignments, are open source, need only minor modifications for reuse with other languages, and have a good performance on small corpora.
We trained these parsers on The Little Prince corpus of the Portuguese language (Anchiêta and Pardo, 2018a), which contains 1,274, 145, and 143 sentences for training, development, and testing, respectively. To compare the results of the parsers, we used the traditional Smatch metric  and the more recently proposed SEMA metric (Anchiêta et al., 2019). SEMA is a more robust metric that considers the parent of the nodes in the graph, producing fairer results than the Smatch metric.  From this table, one realizes that our aligner improved the adapted AMR parsers for Portuguese in both metrics, confirming the results of the intrinsic evaluation. Moreover, the CAMR parser achieved a competitive result (50% f-score) compared to the RBAMR parser (53% f-score over the same corpus) (Anchiêta and Pardo, 2018b), a rule-based AMR parser designed for the Portuguese language.
We also performed a fine-grained error analysis to identify the weaknesses of our aligner. For that, we used the evaluation tool of Damonte et al. (2017) to compare the CAMR parser, as it achieved the best results, with the best aligners JAMR, TAMR, and OURS. We present the results in Table 3  We can see that the CAMR parser + our aligner outperformed the other aligners in most metrics.
The models tied for the Wiki metric due to the corpus not having wiki annotation. In the NER metric, the TAMR aligner performed better than the other aligners. This result is because of the specific rules used to align named entities and the Morphosemantic database that this aligner makes use of. Besides, our aligner achieved only 0.07 of reentrancies, as the aligner is not prepared to align reentrancies. One solution could be to model reentrancies as a tree, according to Zhang et al. (2019a).
Another issue is the occurrence of hidden subjects in Portuguese, i.e., sentences where the subject 'I' is in the graph, but it is not in the sentence (it is implicit). Our aligner tool also ignores this phenomenon. One solution could be to apply a preprocessing to identify and include the hidden subjects in the sentences.
Treating these issues remain for future work. We also intend to investigate the Morphosemantic database, aiming to improve the accuracy in the alignment of named entities.

Conclusion
In this paper, we presented an AMR alignment method designed for the Portuguese language. It is based on pre-trained word embeddings and Word Mover's Distance to match word tokens in the sentences and nodes in the corresponding AMR graphs. This simple approach may be adopted for other languages with few resources, aiming to get tools for natural language understanding tasks. Furthermore, this aligner may help to build or increase semantic resources, using a promising approach as back-translation (Sobrevilla . Future work includes adopting multilingual word embeddings (Lample et al., 2018) to produce alignments for other languages. More details about AMR resources and tools for the Portuguese language may be found at the OPINANDO project webpage 8 .