Translating Italian to LIS in the Rail Stations

This paper presents an ongoing project about the symbolic translation from Italian to Italian Signed Language (LIS) in the rail stations domain. We describe some technical issues in the generation side of the translation, i


Introduction
Several commercial and research projects use avatars for automatic translation into signed languages (SLs) and most of these projects investigate on relatively small domains in which translation may perform quite well. Among them: post office announcements (Cox et al., 2002), weather forecasting (Verlinden et al., 2001;, driver's license renewal (San-Segundo et al., 2012), and train announcements (Segouat and Braffort, 2009;Ebling, 2013). However, SLs still pose many challenges related to the specific linguistic features (e.g. no function words and articles) as well as to the specific communication channels (e.g. the characteristic use of the space).
LIS4ALL is a project for the automatic translation from Italian to LIS in the Italian rail stations domain. The domain is completely specified by the Rete Ferroviaria Italiana (RFI), which produced the manual MAS (Manuale degli Annunci Sonori), that describes the details of each specific message (RFI, 2011). MAS specifies 39 classes: 13 for arriving trains, 15 for leaving trains, 11 for special situations (e.g. strikes). The classes have been designed by a group of linguists to produce concise Italian messages. Full relative clauses, coordination and complex structures are avoided. As a consequence, the Italian rail stations domain is a controlled plain language. Note that the vocal rail station messages are produced in real-time by using text-to-speech, and the textual input message is produced by a proprietary closed source software that uses raw data extracted from a database. In the project we had access only to the textual messages but we do not have access to raw data. As a consequence, LIS4ALL concerns automatic translation with NLG rather than uniquely NLG.
An initial study of the domain (5014 messages form 24 hours messages produced at the Torino Porta Nuova Station) has showed that four classes account for ∼85% of total messages: these are A1: simple arrive, P1: simple leave, A2: arrive on a different rail, A3: delayed arrive. In this paper we discuss a symbolic translator designed to account for these four classes of messages.  Figure 1: The architecture of the Italian to LIS traslator. Figure 1 illustrates the current architecture, thus a pipeline which includes five modules: (1) a regular expression parser for Italian; (2) a filler/slot based semantic interpreter; (3) a generator; (4) a spatial planner; (5) an avatar that performs the synthesis of the sequence of signs. On the basis of this architecture, our translator can be classified as semantic transfer based system (Hutchins and Somer, 1992). Indeed, since the source language is controlled, the translation is a deterministic process that poses a number of challenges just related to the target language, thus in the generation process.
Indeed, the LIS4ALL translator adopts the architecture designed in the ATLAS project  with two essential changes. The first change is related to the analysis of the Italian sentences: the domain corpus shows that typical railway station messages have several prepositional phrases which pose hard problems to conventional natural language parsers. As a consequence, we developed a domain specific regular expressions parser. The second change is related to the semantic representation. In ATLAS a logic semantics was adopted, that was the input of an expert system micro-planner. In contrast, in LIS4ALL we adopt a simpler non-recursive filler/slot semantics. For parsing, we have built four regular expressions corresponding to the four most frequent classes in the domain corpus. For each class, we designed a set of semantic slots that can be filled by domain lexical items (time, rail number, station names, etc.). So, the role of the semantic interpreter is just to extract the semantics of these Italian domain lexical items, and to convert them into a format that can be realized in LIS domain lexical items. The translation consists essentially in converting the filler/slot semantics produced by the semantic interpreter into a logic format that can be exploited in generation. In Table 1 we reported an Italian message, its translation into LIS, and the filler/slot semantics produced by the semantic interpreter.
In Section 2 we give some details about the transfer process to transform the filler/slot semantics in hybrid logic semantics; in Section 3 we describe the implementation in a Combinatory Categorial Grammar (CCG) of a number of specific LIS linguistic phenomena. Finally, Section 4 closes the paper with some considerations. :numero "10220" :impresa_ferroviaria "trainitaly" :categoria "REGIONAL" :località_di_provenienza "cuneo" :ampm "morning" :ora_arrivo "05:35" :hh "5" :mm "35" :tempo_ritardo "10" Table 1: An ITA/LIS sentence of the class A3 and its filler/slot semantics. GLOSSES are used to denote LIS signs. The underlined texts correspond to variable lexical items. Rough translation: The regional train 10220 of Trenitalia arriving at 05:35am from Cuneo, will arrive with an expected delay of 10 minutes.

Microplanning with XML transformations
Previous work on the symbolic translation of SL in rail stations domains adopted "video templates" for the generation side (Segouat and Braffort, 2009;Ebling, 2013). In contrast, our generator is more complex and adopts the standard pipeline architecture of the NLG (Reiter and Dale, 2000). The generator is composed by two elements: a template based microplanner and the OpenCCG realizer.
Following (Foster and White, 2004), we implemented a transformation based microplanner that is able to exploit the filler/slot structure produced by the semantic interpreter. The main idea is to recursively rewrite the semantics elements by using a number predefined XML templates. Four templates are used at the first stage to specify the main structures of the sentences plans, while seven templates are used at the second stage to specify the specific structures of a number of specific linguistic constructions, e.g. the rail number.
For the implementation of the microplanner we exploited the bidirectional nature of OpenCCG by adopting a bottom-up approach to build the XML templates. For each MAS class we choose an Italian sentence belonging to the class and we pro-duce, with the help of a bilingual signer, the LIS translation of the sentence. In Table 1 we report the translation of a sentence belonging to the class A3 (delayed arrive).
By starting from the Italian/LIS translation of the sentence, we have followed four steps: 1. We have implemented the fragment of the grammar necessary to realize/parse the LIS sentence, i.e. to account the linguistic phenomena contained in the sentence (see Section 3).
2. We have obtained the hybrid logic formula expressing the linguistic meaning of the sentence by parsing the sentence.
3. We have modified the XML file containing the hybrid logic formula by introducing a number of holes. Each hole, implemented as a XML attribute, corresponds to a LIS lexical item. For instance, in the XML fragment <diamond mode="SYN-NOUN-RMOD"> <nom name="n4:number"/> <prop id="delay-amount" name="10"/> </diamond> that is the linguistic meaning of the number 10 and that corresponds to the delay amount, we have introduced the hole "delay-amount". In this way, the XML processor will be able to rewrite the exact delay for all the sentences of the class A3.

We have designed a number of XML trans-
formations in order to convert the filler/slot semantics produced by the interpreter to the corresponding logical formula. A single filler/slot semantic element will substitute the XML fragment corresponding to a single hole in the final logical formula.
So, in total we have designed a total amount of 11 XML transformations to account for all the filler/slot semantic elements of the four MAS classes. Note that some of these transformations are recursive. This is the case, for instance, of the train code: LIS signers realize the code with a sequence of digits rather than with a single number, as in Italian (see Table 1).

A CCG for LIS in rail stations
We have designed a new CCG for LIS in the rail stations domain starting from the CCG for LIS devised in ATLAS project .
SLs do not have adpositions and articles, and use pronouns and conjunctions in very specific ways (Brentani, 2010). As a consequence, a very challenging topic is the grammatical design of the modifiers. In contrast to vocal languages, where the modification of a noun with another noun is usually marked by adpositions, in SLs the proximity in the sentence is the only possible indicator of the modification 2 . Indeed, noun modifications occurs often in the the rail station domain: in the LIS sentence of Table 1 there are five noun ← noun modifications, which are used to indicate train code (TRENO ← NUMERO), train company (TRENO ← TRENITALIA), scheduled time (TRENO ← ORE), delay amount (TRENO ← RITARDO). Our CCG design uses type-change for promoting a standard noun category (N ) into a noun modifier category (N \N ). However, this design increases the ambiguity of the grammar, since a noun could be the modifier of all previous nouns. In order to mitigate the grammar ambiguity, we have enriched the noun category with a type-change count feature. The idea is to allow the modification of a noun only if this noun has not been obtained with another type-change. Formally: In this way, the noun NUMERO cannot (1) be modified by the noun RITARDO, and in the same derivation (2) modify the noun TRENO. From another point of view, the introduction of the typechange count feature constrains the hybrid logic dependency structure to be flat.
Another well known problem related to modifiers is their realization order. Similar to vocal languages, SLs have strong pragmatic preferences for specific modifiers order. The symbolic and statistical nature of OpenCCG allows to manage this issue by using a probabilistic approach. Indeed, it is possible to associate probabilities over logically equivalent derivations by using a language model (White, 2005). In order to use this feature, we have built an "artificial" corpus of 50 LIS sentences by using the four most frequent MAS templates. By using this corpus, we have built a trigrams based model language that derives the most natural modifiers sequence.
Another grammar issue concerns the lexical semantics. OpenCCG organizes the lexical items in an ontological structure. In the implementation of the LIS grammar we have used the backbone of the DOLCE ontology (Masolo et al., 2002), i.e. the LIS lexical items (∼120 signs) have been classified under the top level categories of DOLCE. For instance, the semantic category rail has been collocated as a child of the DOLCE category non agentive physical object.
Another specific feature of the LIS CCG lexicon concerns the lexicalization of some station names. Previous approaches to SL translation in the rail station domain propose to fingerspell the names of the secondary stations (Segouat and Braffort, 2009;Ebling, 2013), i.e. the station which do not have a well known name in the national Deaf community. In contrast, we propose to exploit the virtual nature of the avatar by producing a classifier sign that generate dynamically new lexical items. We distinguish two kinds of rail stations: (1) In the case of a well-known station, the avatar uses the sign adopted by the Deaf community; (2) In contrast, in the case of less known station, the avatar realizes a classifier sign indicating a wide board while the name of the station will appear in written Italian "centered on the board" (Figure 2). Note that we had to modify the lexicalization mechanism of OpenCCG with a workaround in order to implement this feature. Indeed OpenCCG assumes a "closed lexicon", i.e. assumes that the lexicon is a closed set completely specified in the grammar. We have introduced a post-processing lexical substitution procedure that replaces a generic sign for less known stations with the board sign, modified in real time with the name of the station. More details on the linguistic impact of the board sign are reported in (Geraci and Mazzei, 2013).

Summary and future work
We have described some issues related to the generation module of the symbolic translator from Italian to LIS designed in the LIS4ALL project. The main contribution of this paper is to show that the combination of a filler/slot semantics with a XML transformation-based microplanner is adequate to generate controlled domain languages.
A prototype of the translator has been implemented in Clojure 3 , that is a functional program-3 http://clojure.org ming language that works on the Java virtual machine. Clojure exploits the the widespread use of Java by allowing (1) to call efficiently external Java libraries, and (2) to deploy software on different machines. Indeed, in order to implement the template based microplanner, we have used the Enlive library 4 , i.e. a selector based system primary designed for web templating. Moreover, the OpenCCG realizer has been natively called from the Clojure code.
In the next future we plan to introduce in the generator the linguistic management of signing space since previous work have showed that CCG can compactly model this linguist feature (Wright, 2008;. Finally, we intend to evaluate the quality of our translator by using both task-based human evaluation  as well as metric-based automatic evaluation (Battaglino et al., 2015).