Antoni Oliver

2023

pdf bib abs
Training and integration of neural machine translation with MTUOC
Antoni Oliver | Sergi Alvarez
Proceedings of the 1st Workshop on Open Community-Driven Machine Translation

In this paper the goals and main objectives of the project MTUOC are presented. This project aims to ease the process of training and integrating neural machine translation (NMT) systems into professional translation environments. The MTUOC project distributes a series of auxiliary tools that allow to perform parallel corpus compilation and preprocessing, as well as the training of NMT systems. The project also distributes a server that implements most of the communication protocols used in computer assisted translation tools.

pdf bib abs
PE effort and neural-based automatic MT metrics: do they correlate?
Sergi Alvarez | Antoni Oliver
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Neural machine translation (NMT) has shown overwhelmingly good results in recent times. This improvement in quality has boosted the presence of NMT in nearly all fields of translation. Most current translation industry workflows include postediting (PE) of MT as part of their process. For many domains and language combinations, translators post-edit raw machine translation (MT) to produce the final document. However, this process can only work properly if the quality of the raw MT output can be assured. MT is usually evaluated using automatic scores, as they are much faster and cheaper. However, traditional automatic scores have not been good quality indicators and do not correlate with PE effort. We analyze the correlation of each of the three dimensions of PE effort (temporal, technical and cognitive) with COMET, a neural framework which has obtained outstanding results in recent MT evaluation campaigns.

The main goal of this project is to explore the techniques for training NMT systems applied to Spanish, Portuguese, Catalan, Galician, Asturian, Aragonese and Aranese. These languages belong to the same Romance family, but they are very different in terms of the linguistic resources available. Asturian, Aragonese and Aranese can be considered low resource languages. These characteristics make this setting an excellent place to explore training techniques for low-resource languages: transfer learning and multilingual systems, among others. The first months of the project have been dedicated to the compilation of monolingual and parallel corpora for Asturian, Aragonese and Aranese.

2020

pdf bib abs
TermEval 2020: Using TSR Filtering Method to Improve Automatic Term Extraction
Antoni Oliver | Mercè Vàzquez
Proceedings of the 6th International Workshop on Computational Terminology

The identification of terms from domain-specific corpora using computational methods is a highly time-consuming task because terms has to be validated by specialists. In order to improve term candidate selection, we have developed the Token Slot Recognition (TSR) method, a filtering strategy based on terminological tokens which is used to rank extracted term candidates from domain-specific corpora. We have implemented this filtering strategy in TBXTools. In this paper we present the system we have used in the TermEval 2020 shared task on monolingual term extraction. We also present the evaluation results for the system for English, French and Dutch and for two corpora: corruption and heart failure. For English and French we have used a linguistic methodology based on POS patterns, and for Dutch we have used a statistical methodology based on n-grams calculation and filtering with stop-words. For all languages, TSR (Token Slot Recognition) filtering method has been applied. We have obtained competitive results, but there is still room for improvement of the system.

pdf bib abs
Neural Metaphor Detection with a Residual biLSTM-CRF Model
Andrés Torres Rivera | Antoni Oliver | Salvador Climent | Marta Coll-Florit
Proceedings of the Second Workshop on Figurative Language Processing

In this paper we present a novel resource-inexpensive architecture for metaphor detection based on a residual bidirectional long short-term memory and conditional random fields. Current approaches on this task rely on deep neural networks to identify metaphorical words, using additional linguistic features or word embeddings. We evaluate our proposed approach using different model configurations that combine embeddings, part of speech tags, and semantically disambiguated synonym sets. This evaluation process was performed using the training and testing partitions of the VU Amsterdam Metaphor Corpus. We use this method of evaluation as reference to compare the results with other current neural approaches for this task that implement similar neural architectures and features, and that were evaluated using this corpus. Results show that our system achieves competitive results with a simpler architecture compared to previous approaches.

pdf bib abs
PosEdiOn: Post-Editing Assessment in PythOn
Antoni Oliver | Sergi Alvarez | Toni Badia
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

There is currently an extended use of post-editing of machine translation (PEMT) in the translation industry. This is due to the increase in the demand of translation and to the significant improvements in quality achieved by neural machine translation (NMT). PEMT has been included as part of the translation workflow because it increases translators’ productivity and it also reduces costs. Although an effective post-editing requires enough quality of the MT output, usual automatic metrics do not always correlate with post-editing effort. We describe a standalone tool designed both for industry and research that has two main purposes: collect sentence-level information from the post-editing process (e.g. post-editing time and keystrokes) and visually present multiple evaluation scores so they can be easily interpreted by a user.

pdf bib abs
Quantitative Analysis of Post-Editing Effort Indicators for NMT
Sergi Alvarez | Antoni Oliver | Toni Badia
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The recent improvements in machine translation (MT) have boosted the use of post-editing (PE) in the translation industry. A new machine translation paradigm, neural machine translation (NMT), is displacing its corpus-based predecessor, statistical machine translation (SMT), in the translation workflows currently implemented because it usually increases the fluency and accuracy of the MT output. However, usual automatic measurements do not always indicate the quality of the MT output and there is still no clear correlation between PE effort and productivity. We present a quantitative analysis of different PE effort indicators for two NMT systems (transformer and seq2seq) for English-Spanish in-domain medical documents. We compare both systems and study the correlation between PE time and other scores. Results show less PE effort for the transformer NMT model and a high correlation between PE time and keystrokes.

pdf bib abs
MTUOC: easy and free integration of NMT systems in professional translation environments
Antoni Oliver
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

In this paper the MTUOC project, aiming to provide an easy integration of neural and statistical machine translation systems, is presented. Almost all the required software to train and use neural and statistical MT systems are released under free licences. However, their use is not always easy and intuitive and medium-high specialized skills are required. MTUOC project provides simplified scripts for preprocessing and training MT systems, and a server and client for easy use of the trained systems. The server is compatible with popular CAT tools for a seamless integration. The project also distributes some free engines.

pdf bib abs
INMIGRA3: building a case for NGOs and NMT
Celia Rico | María Del Mar Sánchez Ramos | Antoni Oliver
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

INMIGRA3 is a three-year project that builds on the work of two previous initi-atives: INMIGRA2-CM and CRISIS-MT . Together, they address the specific needs of NGOs in multilingual settings with a particular interest in migratory contexts. Work on INMIGRA3 concentrates in the analysis of how best can be NMT put to use for the purposes of translating NGOs documentation.

pdf bib abs
Aligning Wikipedia with WordNet:a Review and Evaluation of Different Techniques
Antoni Oliver
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper we explore techniques for aligning Wikipedia articles with WordNet synsets, their successful alignment being our main goal. We evaluate techniques that use the definitions and sense relations in Wordnet and the text and categories in Wikipedia articles. The results we present are based on two evaluation strategies: one uses a new gold and silver standard (for which the creation process is explained); the other creates wordnets in other languages and then compares them with existing wordnets for those languages found in the Open Multilingual Wordnet project. A reliable alignment between WordNet and Wikipedia is a very valuable resource for the creation of new wordnets in other languages and for the development of existing wordnets. The evaluation of alignments between WordNet and lexical resources is a difficult and time-consuming task, but the evaluation strategy using the Open Multilingual Wordnet can be used as an automated evaluation measure to assess the quality of alignments between these two resources.

pdf bib abs
ReSiPC: a Tool for Complex Searches in Parallel Corpora
Antoni Oliver | Bojana Mikelenić
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, a tool specifically designed to allow for complex searches in large parallel corpora is presented. The formalism for the queries is very powerful as it uses standard regular expressions that allow for complex queries combining word forms, lemmata and POS-tags. As queries are performed over POS-tags, at least one of the languages in the parallel corpus should be POS-tagged. Searches can be performed in one of the languages or in both languages at the same time. The program is able to POS-tag the corpora using the Freeling analyzer through its Python API. ReSiPC is developed in Python version 3 and it is distributed under a free license (GNU GPL). The tool can be used to provide data for contrastive linguistics research and an example of use in a Spanish-Croatian parallel corpus is presented. ReSiPC is designed for queries in POS-tagged corpora, but it can be easily adapted for querying corpora containing other kinds of information.

2019

pdf bib
Does NMT make a difference when post-editing closely related languages? The case of Spanish-Catalan
Sergi Alvarez | Antoni Oliver | Toni Badia
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

pdf bib abs
Further expansion of the Croatian WordNet
Krešimir Šojat | Matea Filko | Antoni Oliver
Proceedings of the 9th Global Wordnet Conference

In this paper a semi-automatic procedure for the expansion of the Croatian Wordnet (CroWN) is presented. An English-Croatian dictionary was used in order to translate monosemous PWN 3.0 English variants. The precision values of the automatic process is low (about 30%), but the results proved valuable for the enlargment of CroWN. After manual validation, 10,884 new synset-variant pairs were added to CroWN, achieving a total of 62,075 synset-variant pairs.

2017

pdf bib abs
Morphological Analysis of the Dravidian Language Family
Arun Kumar | Ryan Cotterell | Lluís Padró | Antoni Oliver
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The Dravidian languages are one of the most widely spoken language families in the world, yet there are very few annotated resources available to NLP researchers. To remedy this, we create DravMorph, a corpus annotated for morphological segmentation and part-of-speech. Additionally, we exploit novel features and higher-order models to set state-of-the-art results on these corpora on both tasks, beating techniques proposed in the literature by as much as 4 points in segmentation F1.

2016

pdf bib abs
Extending the WN-Toolkit: dealing with polysemous words in the dictionary-based strategy
Antoni Oliver
Proceedings of the 8th Global WordNet Conference (GWC)

In this paper we present an extension of the dictionary-based strategy for wordnet construction implemented in the WN-Toolkit. This strategy allows the extraction of information for polysemous English words if definitions and/or semantic relations are present in the dictionary. The WN-Toolkit is a freely available set of programs for the creation and expansion of wordnets using dictionary-based and parallel-corpus based strategies. In previous versions of the toolkit the dictionary-based strategy was only used for translating monosemous English variants. In the experiments we have used Omegawiki and Wiktionary and we present automatic evaluation results for 24 languages that have wordnets in the Open Multilingual Wordnet project. We have used these existing versions of the wordnet to perform an automatic evaluation.

2015

pdf bib
Joint Bayesian Morphology Learning for Dravidian Languages
Arun Kumar | Lluís Padró | Antoni Oliver
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects

pdf bib
Learning Agglutinative Morphology of Indian Languages with Linguistically Motivated Adaptor Grammars
Arun Kumar | Lluís Padró | Antoni Oliver
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology Extraction
Antoni Oliver | Mercè Vàzquez
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Enlarging the Croatian WordNet with WN-Toolkit and Cro-Deriv
Antoni Oliver | Krešimir Šojat | Matea Srebačić
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
WN-Toolkit: Automatic generation of WordNets following the expand model
Antoni Oliver
Proceedings of the Seventh Global Wordnet Conference

pdf bib abs
Automatic creation of WordNets from parallel corpora
Antoni Oliver | Salvador Climent
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling and UKB. After this step, the process of WordNet creation is converted into a word alignment problem, where we want to alignWordNet synsets in the English part of the corpus with lemmata on the target language part of the corpus. The word alignment algorithm used in these experiments is a simple most frequent translation algorithm implemented into the WN-Toolkit. The obtained precision values are quite satisfactory, but the overall number of extracted synset-variant pairs is too low, leading into very poor recall values. In the conclusions, the use of more advanced word alignment algorithms, such as Giza++, Fast Align or Berkeley aligner is suggested.

2008

This paper presents the complete and consistent ontological annotation of the nominal part of WordNet. The annotation has been carried out using the semantic features defined in the EuroWordNet Top Concept Ontology and made available to the NLP community. Up to now only an initial core set of 1,024 synsets, the so-called Base Concepts, was ontologized in such a way. The work has been achieved by following a methodology based on an iterative and incremental expansion of the initial labeling through the hierarchy while setting inheritance blockage points. Since this labeling has been set on the EuroWordNets Interlingual Index (ILI), it can be also used to populate any other wordnet linked to it through a simple porting process. This feature-annotated WordNet is intended to be useful for a large number of semantic NLP tasks and for testing for the first time componential analysis on real environments. Moreover, the quantitative analysis of the work shows that more than 40% of the nominal part of WordNet is involved in structure errors or inadequacies.

This paper presents experiments for enlarging the Croatian Morphological Lexicon by applying an automatic acquisition methodology. The basic sources of information for the system are a set of morphological rules and a raw corpus. The morphological rules have been automatically derived from the existing Croatian Morphological Lexicon and we have used in our experiments a subset of the Croatian National Corpus. The methodology has proved to be efficient for those languages that, like Croatian, present a rich and mainly concatenative morphology. This method can be applied for the creation of new resources, as well as in the enrichment of existing ones. We also present an extension of the system that uses automatic querying to Internet to acquire those entries for which we have not enough information in our corpus.

pdf bib
A Grammar and Style Checker Based on Internet Searches
Joaquim Moré | Salvador Climent | Antoni Oliver
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)