Sebastian Möller


2023

pdf bib
InterroLang: Exploring NLP Models and Datasets through Dialogue-based Explanations
Nils Feldhus | Qianli Wang | Tatiana Anikina | Sahil Chopra | Cennet Oguz | Sebastian Möller
Findings of the Association for Computational Linguistics: EMNLP 2023

While recently developed NLP explainability methods let us open the black box in various ways (Madsen et al., 2022), a missing ingredient in this endeavor is an interactive tool offering a conversational interface. Such a dialogue system can help users explore datasets and models with explanations in a contextualized manner, e.g. via clarification or follow-up questions, and through a natural language interface. We adapt the conversational explanation framework TalkToModel (Slack et al., 2022) to the NLP domain, add new NLP-specific operations such as free-text rationalization, and illustrate its generalizability on three NLP tasks (dialogue act classification, question answering, hate speech detection). To recognize user queries for explanations, we evaluate fine-tuned and few-shot prompting models and implement a novel adapter-based approach. We then conduct two user studies on (1) the perceived correctness and helpfulness of the dialogues, and (2) the simulatability, i.e. how objectively helpful dialogical explanations are for humans in figuring out the model’s predicted label when it’s not shown. We found rationalization and feature attribution were helpful in explaining the model behavior. Moreover, users could more reliably predict the model outcome based on an explanation dialogue rather than one-off explanations.

pdf bib
Linguistically Motivated Evaluation of the 2023 State-of-the-art Machine Translation: Can ChatGPT Outperform NMT?
Shushen Manakhimova | Eleftherios Avramidis | Vivien Macketanz | Ekaterina Lapshinova-Koltunski | Sergei Bagdasarov | Sebastian Möller
Proceedings of the Eighth Conference on Machine Translation

This paper offers a fine-grained analysis of the machine translation outputs in the context of the Shared Task at the 8th Conference of Machine Translation (WMT23). Building on the foundation of previous test suite efforts, our analysis includes Large Language Models and an updated test set featuring new linguistic phenomena. To our knowledge, this is the first fine-grained linguistic analysis for the GPT-4 translation outputs. Our evaluation spans German-English, English-German, and English-Russian language directions. Some of the phenomena with the lowest accuracies for German-English are idioms and resultative predicates. For English-German, these include mediopassive voice, and noun formation(er). As for English-Russian, these included idioms and semantic roles. GPT-4 performs equally or comparably to the best systems in German-English and English-German but falls in the second significance cluster for English-Russian.

pdf bib
Challenging the State-of-the-art Machine Translation Metrics from a Linguistic Perspective
Eleftherios Avramidis | Shushen Manakhimova | Vivien Macketanz | Sebastian Möller
Proceedings of the Eighth Conference on Machine Translation

We employ a linguistically motivated challenge set in order to evaluate the state-of-the-art machine translation metrics submitted to the Metrics Shared Task of the 8th Conference for Machine Translation. The challenge set includes about 21,000 items extracted from 155 machine translation systems for three language directions, covering more than 100 linguistically-motivated phenomena organized in 14 categories. The metrics that have the best performance with regard to our linguistically motivated analysis are the Cometoid22-wmt23 (a trained metric based on distillation) for German-English and MetricX-23-c (based on a fine-tuned mT5 encoder-decoder language model) for English-German and English-Russian. Some of the most difficult phenomena are passive voice for German-English, named entities, terminology and measurement units for English-German, and focus particles, adverbial clause and stripping for English-Russian.

pdf bib
MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset
Leonhard Hennig | Philippe Thomas | Sebastian Möller
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Relation extraction (RE) is a fundamental task in information extraction, whose extension to multilingual settings has been hindered by the lack of supervised resources comparable in size to large English datasets such as TACRED (Zhang et al., 2017). To address this gap, we introduce the MultiTACRED dataset, covering 12 typologically diverse languages from 9 language families, which is created by machine-translating TACRED instances and automatically projecting their entity annotations. We analyze translation and annotation projection quality, identify error categories, and experimentally evaluate fine-tuned pretrained mono- and multilingual language models in common transfer learning scenarios. Our analyses show that machine translation is a viable strategy to transfer RE instances, with native speakers judging more than 83% of the translated instances to be linguistically and semantically acceptable. We find monolingual RE model performance to be comparable to the English original for many of the target languages, and that multilingual models trained on a combination of English and target language data can outperform their monolingual counterparts. However, we also observe a variety of translation and annotation projection errors, both due to the MT systems and linguistic features of the target languages, such as pronoun-dropping, compounding and inflection, that degrade dataset quality and RE model performance.

pdf bib
Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods
Nils Feldhus | Leonhard Hennig | Maximilian Dustin Nasert | Christopher Ebert | Robert Schwarzenberg | Sebastian Möller
Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE)

Saliency maps can explain a neural model’s predictions by identifying important input features. They are difficult to interpret for laypeople, especially for instances with many features. In order to make them more accessible, we formalize the underexplored task of translating saliency maps into natural language and compare methods that address two key challenges of this approach – what and how to verbalize. In both automatic and human evaluation setups, using token-level attributions from text classification tasks, we compare two novel methods (search-based and instruction-based verbalizations) against conventional feature importance representations (heatmap visualizations and extractive rationales), measuring simulatability, faithfulness, helpfulness and ease of understanding. Instructing GPT-3.5 to generate saliency map verbalizations yields plausible explanations which include associations, abstractive summarization and commonsense reasoning, achieving by far the highest human ratings, but they are not faithfully capturing numeric information and are inconsistent in their interpretation of the task. In comparison, our search-based, model-free verbalization approach efficiently completes templated verbalizations, is faithful by design, but falls short in helpfulness and simulatability. Our results suggest that saliency map verbalization makes feature attribution explanations more comprehensible and less cognitively challenging to humans than conventional representations.

2022

pdf bib
Perceptual Quality Dimensions of Machine-Generated Text with a Focus on Machine Translation
Vivien Macketanz | Babak Naderi | Steven Schmidt | Sebastian Möller
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

The quality of machine-generated text is a complex construct consisting of various aspects and dimensions. We present a study that aims to uncover relevant perceptual quality dimensions for one type of machine-generated text, that is, Machine Translation. We conducted a crowdsourcing survey in the style of a Semantic Differential to collect attribute ratings for German MT outputs. An Exploratory Factor Analysis revealed the underlying perceptual dimensions. As a result, we extracted four factors that operate as relevant dimensions for the Quality of Experience of MT outputs: precision, complexity, grammaticality, and transparency.

pdf bib
Linguistically Motivated Evaluation of the 2022 State-of-the-art Machine Translation Systems for Three Language Directions
Vivien Macketanz | Shushen Manakhimova | Eleftherios Avramidis | Ekaterina Lapshinova-koltunski | Sergei Bagdasarov | Sebastian Möller
Proceedings of the Seventh Conference on Machine Translation (WMT)

This document describes a fine-grained linguistically motivated analysis of 29 machine translation systems submitted at the Shared Task of the 7th Conference of Machine Translation (WMT22). This submission expands the test suite work of previous years by adding the language direction of English–Russian. As a result, evaluation takes place for the language directions of German–English, English–German, and English–Russian. We find that the German–English systems suffer in translating idioms, some tenses of modal verbs, and resultative predicates, the English–German ones in idioms, transitive-past progressive, and middle voice, whereas the English–Russian ones in pseudogapping and idioms.

pdf bib
Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text
Sebastian Möller | Salar Mohtaj | Babak Naderi
Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text

pdf bib
Overview of the GermEval 2022 Shared Task on Text Complexity Assessment of German Text
Salar Mohtaj | Babak Naderi | Sebastian Möller
Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text

In this paper we present the GermEval 2022 shared task on Text Complexity Assessment of German text. Text forms an integral part of exchanging information and interacting with the world, correlating with quality and experience of life. Text complexity is one of the factors which affects a reader’s understanding of a text. The mapping of a body of text to a mathematical unit quantifying the degree of readability is the basis of complexity assessment. As readability might be influenced by representation, we only target the text complexity for readers in this task. We designed the task as text regression in which participants developed models to predict complexity of pieces of text for a German learner in a range from 1 to 7. The shared task is organized in two phases; the development and the test phases. Among 24 participants who registered for the shared task, ten teams submitted their results on the test data.

pdf bib
MuLVE, A Multi-Language Vocabulary Evaluation Data Set
Anik Jacobsen | Salar Mohtaj | Sebastian Möller
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Vocabulary learning is vital to foreign language learning. Correct and adequate feedback is essential to successful and satisfying vocabulary training. However, many vocabulary and language evaluation systems perform on simple rules and do not account for real-life user learning data. This work introduces Multi-Language Vocabulary Evaluation Data Set (MuLVE), a data set consisting of vocabulary cards and real-life user answers, labeled indicating whether the user answer is correct or incorrect. The data source is user learning data from the Phase6 vocabulary trainer. The data set contains vocabulary questions in German and English, Spanish, and French as target language and is available in four different variations regarding pre-processing and deduplication. We experiment to fine-tune pre-trained BERT language models on the downstream task of vocabulary evaluation with the proposed MuLVE data set. The results provide outstanding results of > 95.5 accuracy and F2-score. The data set is available on the European Language Grid.

pdf bib
Subjective Text Complexity Assessment for German
Laura Seiffe | Fares Kallel | Sebastian Möller | Babak Naderi | Roland Roller
Proceedings of the Thirteenth Language Resources and Evaluation Conference

For different reasons, text can be difficult to read and understand for many people, especially if the text’s language is too complex. In order to provide suitable text for the target audience, it is necessary to measure its complexity. In this paper we describe subjective experiments to assess the readability of German text. We compile a new corpus of sentences provided by a German IT service provider. The sentences are annotated with the subjective complexity ratings by two groups of participants, namely experts and non-experts for that text domain. We then extract an extensive set of linguistically motivated features that are supposedly interacting with complexity perception. We show that a linear regression model with a subset of these features can be a very good predictor of text complexity.

pdf bib
A Linguistically Motivated Test Suite to Semi-Automatically Evaluate German–English Machine Translation Output
Vivien Macketanz | Eleftherios Avramidis | Aljoscha Burchardt | He Wang | Renlong Ai | Shushen Manakhimova | Ursula Strohriegel | Sebastian Möller | Hans Uszkoreit
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents a fine-grained test suite for the language pair German–English. The test suite is based on a number of linguistically motivated categories and phenomena and the semi-automatic evaluation is carried out with regular expressions. We describe the creation and implementation of the test suite in detail, providing a full list of all categories and phenomena. Furthermore, we present various exemplary applications of our test suite that have been implemented in the past years, like contributions to the Conference of Machine Translation, the usage of the test suite and MT outputs for quality estimation, and the expansion of the test suite to the language pair Portuguese–English. We describe how we tracked the development of the performance of various systems MT systems over the years with the help of the test suite and which categories and phenomena are prone to resulting in MT errors. For the first time, we also make a large part of our test suite publicly available to the research community.

pdf bib
Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient’s Perspective
Lisa Raithel | Philippe Thomas | Roland Roller | Oliver Sapina | Sebastian Möller | Pierre Zweigenbaum
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection and provide preliminary experiments for binary classification using different methods of zero- and few-shot learning based on a multi-lingual model. When fine-tuning XLM-RoBERTa first on English patient forum data and then on the new German data, we achieve an F1-score of 37.52 for the positive class. We make the dataset and models publicly available for the community.

pdf bib
TUB at WANLP22 Shared Task: Using Semantic Similarity for Propaganda Detection in Arabic
Salar Mohtaj | Sebastian Möller
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Propaganda and the spreading of fake news through social media have become a serious problem in recent years. In this paper we present our approach for the shared task on propaganda detection in Arabic in which the goal is to identify propaganda techniques in the Arabic social media text. We propose a semantic similarity detection model to compare text in the test set with the sentences in the train set to find the most similar instances. The label of the target text is obtained from the most similar texts in the train set. The proposed model obtained the micro F1 score of 0.494 on the text data set.

pdf bib
Using Neural Machine Translation Methods for Sign Language Translation
Galina Angelova | Eleftherios Avramidis | Sebastian Möller
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

We examine methods and techniques, proven to be helpful for the text-to-text translation of spoken languages in the context of gloss-to-text translation systems, where the glosses are the written representation of the signs. We present one of the first works that include experiments on both parallel corpora of the German Sign Language (PHOENIX14T and the Public DGS Corpus). We experiment with two NMT architectures with optimization of their hyperparameters, several tokenization methods and two data augmentation techniques (back-translation and paraphrasing). Through our investigation we achieve a substantial improvement of 5.0 and 2.2 BLEU scores for the models trained on the two corpora respectively. Our RNN models outperform our Transformer models, and the segmentation method we achieve best results with is BPE, whereas back-translation and paraphrasing lead to minor but not significant improvements.

pdf bib
Towards Personality-Aware Chatbots
Daniel Fernau | Stefan Hillmann | Nils Feldhus | Tim Polzehl | Sebastian Möller
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Chatbots are increasingly used to automate operational processes in customer service. However, most chatbots lack adaptation towards their users which may results in an unsatisfactory experience. Since knowing and meeting personal preferences is a key factor for enhancing usability in conversational agents, in this study we analyze an adaptive conversational agent that can automatically adjust according to a user’s personality type carefully excerpted from the Myers-Briggs type indicators. An experiment including 300 crowd workers examined how typifications like extroversion/introversion and thinking/feeling can be assessed and designed for a conversational agent in a job recommender domain. Our results validate the proposed design choices, and experiments on a user-matched personality typification, following the so-called law of attraction rule, show a significant positive influence on a range of selected usability criteria such as overall satisfaction, naturalness, promoter score, trust and appropriateness of the conversation.

2021

pdf bib
Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Only a small portion of research papers with human evaluation for text summarization provide information about the participant demographics, task design, and experiment protocol. Additionally, many researchers use human evaluation as gold standard without questioning the reliability or investigating the factors that might affect the reliability of the human evaluation. As a result, there is a lack of best practices for reliable human summarization evaluation grounded by empirical evidence. To investigate human evaluation reliability, we conduct a series of human evaluation experiments, provide an overview of participant demographics, task design, experimental set-up and compare the results from different experiments. Based on our empirical analysis, we provide guidelines to ensure the reliability of expert and non-expert evaluations, and we determine the factors that might affect the reliability of the human evaluation.

pdf bib
Linguistic Evaluation for the 2021 State-of-the-art Machine Translation Systems for German to English and English to German
Vivien Macketanz | Eleftherios Avramidis | Shushen Manakhimova | Sebastian Möller
Proceedings of the Sixth Conference on Machine Translation

We are using a semi-automated test suite in order to provide a fine-grained linguistic evaluation for state-of-the-art machine translation systems. The evaluation includes 18 German to English and 18 English to German systems, submitted to the Translation Shared Task of the 2021 Conference on Machine Translation. Our submission adds up to the submissions of the previous years by creating and applying a wide-range test suite for English to German as a new language pair. The fine-grained evaluation allows spotting significant differences between systems that cannot be distinguished by the direct assessment of the human evaluation campaign. We find that most of the systems achieve good accuracies in the majority of linguistic phenomena but there are few phenomena with lower accuracy, such as the idioms, the modal pluperfect and the German resultative predicates. Two systems have significantly better test suite accuracy in macro-average in every language direction, Online-W and Facebook-AI for German to English and VolcTrans and Online-W for English to German. The systems show a steady improvement as compared to previous years.

pdf bib
Towards Hybrid Human-Machine Workflow for Natural Language Generation
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

In recent years, crowdsourcing has gained much attention from researchers to generate data for the Natural Language Generation (NLG) tools or to evaluate them. However, the quality of crowdsourced data has been questioned repeatedly because of the complexity of NLG tasks and crowd workers’ unknown skills. Moreover, crowdsourcing can also be costly and often not feasible for large-scale data generation or evaluation. To overcome these challenges and leverage the complementary strengths of humans and machine tools, we propose a hybrid human-machine workflow designed explicitly for NLG tasks with real-time quality control mechanisms under budget constraints. This hybrid methodology is a powerful tool for achieving high-quality data while preserving efficiency. By combining human and machine intelligence, the proposed workflow decides dynamically on the next step based on the data from previous steps and given constraints. Our goal is to provide not only the theoretical foundations of the hybrid workflow but also to provide its implementation as open-source in future work.

pdf bib
Thermostat: A Large Collection of NLP Model Explanations and Analysis Tools
Nils Feldhus | Robert Schwarzenberg | Sebastian Möller
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

In the language domain, as in other domains, neural explainability takes an ever more important role, with feature attribution methods on the forefront. Many such methods require considerable computational resources and expert knowledge about implementation details and parameter choices. To facilitate research, we present Thermostat which consists of a large collection of model explanations and accompanying analysis tools. Thermostat allows easy access to over 200k explanations for the decisions of prominent state-of-the-art models spanning across different NLP tasks, generated with multiple explainers. The dataset took over 10k GPU hours (> one year) to compile; compute time that the community now saves. The accompanying software tools allow to analyse explanations instance-wise but also accumulatively on corpus level. Users can investigate and compare models, datasets and explainers without the need to orchestrate implementation details. Thermostat is fully open source, democratizes explainability research in the language domain, circumvents redundant computations and increases comparability and replicability.

pdf bib
Efficient Explanations from Empirical Explainers
Robert Schwarzenberg | Nils Feldhus | Sebastian Möller
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Amid a discussion about Green AI in which we see explainability neglected, we explore the possibility to efficiently approximate computationally expensive explainers. To this end, we propose feature attribution modelling with Empirical Explainers. Empirical Explainers learn from data to predict the attribution maps of expensive explainers. We train and test Empirical Explainers in the language domain and find that they model their expensive counterparts surprisingly well, at a fraction of the cost. They could thus mitigate the computational burden of neural explanations significantly, in applications that tolerate an approximation error.

2020

pdf bib
Simulating Turn-Taking in Conversations with Delayed Transmission
Thilo Michael | Sebastian Möller
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Conversations over the telephone require timely turn-taking cues that signal the participants when to speak and when to listen. When a two-way transmission delay is introduced into such conversations, the immediate feedback is delayed, and the interactivity of the conversation is impaired. With delayed speech on each side of the transmission, different conversation realities emerge on both ends, which alters the way the participants interact with each other. Simulating conversations can give insights on turn-taking and spoken interactions between humans but can also used for analyzing and even predicting human behavior in conversations. In this paper, we simulate two types of conversations with distinct levels of interactivity. We then introduce three levels of two-way transmission delay between the agents and compare the resulting interaction-patterns with human-to-human dialog from an empirical study. We show how the turn-taking mechanisms modeled for conversations without delay perform in scenarios with delay and identify to which extend the simulation is able to model the delayed turn-taking observed in human conversation.

pdf bib
Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.

pdf bib
Towards a Reliable and Robust Methodology for Crowd-Based Subjective Quality Assessment of Query-Based Extractive Text Summarization
Neslihan Iskender | Tim Polzehl | Sebastian Möller
Proceedings of the Twelfth Language Resources and Evaluation Conference

The intrinsic and extrinsic quality evaluation is an essential part of the summary evaluation methodology usually conducted in a traditional controlled laboratory environment. However, processing large text corpora using these methods reveals expensive from both the organizational and the financial perspective. For the first time, and as a fast, scalable, and cost-effective alternative, we propose micro-task crowdsourcing to evaluate both the intrinsic and extrinsic quality of query-based extractive text summaries. To investigate the appropriateness of crowdsourcing for this task, we conduct intensive comparative crowdsourcing and laboratory experiments, evaluating nine extrinsic and intrinsic quality measures on 5-point MOS scales. Correlating results of crowd and laboratory ratings reveals high applicability of crowdsourcing for the factors overall quality, grammaticality, non-redundancy, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness. Further, we investigate the effect of the number of repetitions of assessments on the robustness of mean opinion score of crowd ratings, measured against the increase of correlation coefficients between crowd and laboratory. Our results suggest that the optimal number of repetitions in crowdsourcing setups, in which any additional repetitions do no longer cause an adequate increase of overall correlation coefficients, lies between seven and nine for intrinsic and extrinsic quality factors.

pdf bib
An Empirical Comparison of Question Classification Methods for Question Answering Systems
Eduardo Cortes | Vinicius Woloszyn | Arne Binder | Tilo Himmelsbach | Dante Barone | Sebastian Möller
Proceedings of the Twelfth Language Resources and Evaluation Conference

Question classification is an important component of Question Answering Systems responsible for identifying the type of an answer a particular question requires. For instance, “Who is the prime minister of the United Kingdom?” demands a name of a PERSON, while “When was the queen of the United Kingdom born?” entails a DATE. This work makes an extensible review of the most recent methods for Question Classification, taking into consideration their applicability in low-resourced languages. First, we propose a manual classification of the current state-of-the-art methods in four distinct categories: low, medium, high, and very high level of dependency on external resources. Second, we applied this categorization in an empirical comparison in terms of the amount of data necessary for training and performance in different languages. In addition to complementing earlier works in this field, our study shows a boost on methods relying on recent language models, overcoming methods not suitable for low-resourced languages.

pdf bib
From Witch’s Shot to Music Making Bones - Resources for Medical Laymen to Technical Language and Vice Versa
Laura Seiffe | Oliver Marten | Michael Mikhailov | Sven Schmeier | Sebastian Möller | Roland Roller
Proceedings of the Twelfth Language Resources and Evaluation Conference

Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. Information we share online unveil directly or indirectly information about our lifestyle and health situation. Particularly when text input is getting longer or multiple messages can be linked to each other. Those information can be then used to detect possible risk factors of diseases or adverse drug reactions of medications. However, as most people are not medical experts, language used might be more descriptive rather than the precise medical expression as medics do. To detect and use those relevant information, laymen language has to be translated and/or linked against the corresponding medical concept. This work presents baseline data sources in order to address this challenge for German language. We introduce a new dataset which annotates medical laymen and technical expressions in a patient forum, along with a set of medical synonyms and definitions, and present first baseline results on the data.

pdf bib
Claim extraction from text using transfer learning.
Acharya Ashish Prabhakar | Salar Mohtaj | Sebastian Möller
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Building an end to end fake news detection system consists of detecting claims in text and later verifying them for their authenticity. Although most of the recent works have focused on political claims, fake news can also be propagated in the form of religious intolerance, conspiracy theories etc. Since there is a lack of training data specific to all these scenarios, we compiled a homogeneous and balanced dataset by combining some of the currently available data. Moreover, it is shown in the paper that how recent advancements in transfer learning can be leveraged to detect claims, in general. The obtained result shows that the recently developed transformers can transfer the tendency of research from claim detection to the problem of check worthiness of claims in domains of interest.

pdf bib
Fine-grained linguistic evaluation for state-of-the-art Machine Translation
Eleftherios Avramidis | Vivien Macketanz | Ursula Strohriegel | Aljoscha Burchardt | Sebastian Möller
Proceedings of the Fifth Conference on Machine Translation

This paper describes a test suite submission providing detailed statistics of linguistic performance for the state-of-the-art German-English systems of the Fifth Conference of Machine Translation (WMT20). The analysis covers 107 phenomena organized in 14 categories based on about 5,500 test items, including a manual annotation effort of 45 person hours. Two systems (Tohoku and Huoshan) appear to have significantly better test suite accuracy than the others, although the best system of WMT20 is not significantly better than the one from WMT19 in a macro-average. Additionally, we identify some linguistic phenomena where all systems suffer (such as idioms, resultative predicates and pluperfect), but we are also able to identify particular weaknesses for individual systems (such as quotation marks, lexical ambiguity and sluicing). Most of the systems of WMT19 which submitted new versions this year show improvements.

2019

pdf bib
Train, Sort, Explain: Learning to Diagnose Translation Models
Robert Schwarzenberg | David Harbecke | Vivien Macketanz | Eleftherios Avramidis | Sebastian Möller
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Evaluating translation models is a trade-off between effort and detail. On the one end of the spectrum there are automatic count-based methods such as BLEU, on the other end linguistic evaluations by humans, which arguably are more informative but also require a disproportionately high effort. To narrow the spectrum, we propose a general approach on how to automatically expose systematic differences between human and machine translations to human experts. Inspired by adversarial settings, we train a neural text classifier to distinguish human from machine translations. A classifier that performs and generalizes well after training should recognize systematic differences between the two classes, which we uncover with neural explainability methods. Our proof-of-concept implementation, DiaMaT, is open source. Applied to a dataset translated by a state-of-the-art neural Transformer model, DiaMaT achieves a classification accuracy of 75% and exposes meaningful differences between humans and the Transformer, amidst the current discussion about human parity.

2012

pdf bib
Position Paper: Towards Standardized Metrics and Tools for Spoken and Multimodal Dialog System Evaluation
Sebastian Möller | Klaus-Peter Engelbrecht | Florian Kretzschmar | Stefan Schmidt | Benjamin Weiss
NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012)

2009

pdf bib
Modeling User Satisfaction with Hidden Markov Models
Klaus-Peter Engelbrecht | Florian Gödde | Felix Hartard | Hamed Ketabdar | Sebastian Möller
Proceedings of the SIGDIAL 2009 Conference

2008

pdf bib
Corpus Analysis of Spoken Smart-Home Interactions with Older Users
Sebastian Möller | Florian Gödde | Maria Wolters
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we present the collection and analysis of a spoken dialogue corpus obtained from interactions of older and younger users with a smart-home system. Our aim is to identify the amount and the origin of linguistic differences in the way older and younger users address the system. In addition, we investigate changes in the users’ linguistic behaviour after exposure to the system. The results show that the two user groups differ in their speaking style as well as their vocabulary. In contrast to younger users, who adapt their speaking style to the expected limitations of the system, older users tend to use a speaking style that is closer to human-human communication in terms of sentence complexity and politeness. However, older users are far less easy to stereotype than younger users.

pdf bib
A Framework for Model-based Evaluation of Spoken Dialog Systems
Sebastian Möller | Nigel Ward
Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue

2007

pdf bib
Pragmatic Usage of Linear Regression Models for the Prediction of User Judgments
Klaus-Peter Engelbrecht | Sebastian Möller
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

2006

pdf bib
Set-up of a Unit-Selection Synthesis with a Prominent Voice
Stefan Breuer | Sven Bergmann | Ralf Dragon | Sebastian Möller
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we describe the set-up process and an initial evaluation of a unit-selection speech synthesizer. The synthesizer is specific in that it is intended to speak with a prominent voice. As a consequence, only very limited resources were available for setting up the unit database. These resources have been extracted from an audio book, segmented with the help of an HMM-based wrapper, and then used with the non-uniform unit-selection approach implemented in the Bonn Open Synthesis System (BOSS). In order to adapt the database to the BOSS implementation, the label files were amended by phrase boundaries, converted to XML, amended by prosodic and spectral information, and then further converted to a MySQL relational database structure. The BOSS system selects units on the basis of this information, adding individual unit costs to the concatenation costs given by MFCC and F0 distances. The paper discusses the problems which occurred during the database set-up, the invested effort, as well as the quality level which can be reached by this approach.

2005

pdf bib
Parameters for Quantifying the Interaction with Spoken Dialogue Telephone Services
Sebastian Möller
Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue

2004

pdf bib
INSPIRE: Evaluation of a Smart-Home System for Infotainment Management and Device Control
Sebastian Möller | Jan Krebber | Alexander Raake | Paula Smeele | Martin Rajman | Mirek Melichar | Vincenzo Pallotta | Gianna Tsakou | Basilis Kladis | Anestis Vovos | Jettie Hoonhout | Dietmar Schuchardt | Nikos Fakotakis | Todor Ganchev | Ilyas Potamitis
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
A New ITU-T Recommendation on the Evaluation of Telephone-Based Spoken Dialogue Systems
Sebastian Möller
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

pdf bib
Diagnostic Assessment of Telephone Transmission Impact on ASR Performance and Human-to-Human Speech Quality
Sebastian Möller | Ergina Kavallieratou
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
A new Taxonomy for the Quality of Telephone Services Based on Spoken Dialogue Systems
Sebastian Möller
Proceedings of the Third SIGdial Workshop on Discourse and Dialogue