Martin Jansche


2020

pdf bib
Burmese Speech Corpus, Finite-State Text Normalization and Pronunciation Grammars with an Application to Text-to-Speech
Yin May Oo | Theeraphol Wattanavekin | Chenfang Li | Pasindu De Silva | Supheakmungkol Sarin | Knot Pipatsrisawat | Martin Jansche | Oddur Kjartansson | Alexander Gutkin
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper introduces an open-source crowd-sourced multi-speaker speech corpus along with the comprehensive set of finite-state transducer (FST) grammars for performing text normalization for the Burmese (Myanmar) language. We also introduce the open-source finite-state grammars for performing grapheme-to-phoneme (G2P) conversion for Burmese. These three components are necessary (but not sufficient) for building a high-quality text-to-speech (TTS) system for Burmese, a tonal Southeast Asian language from the Sino-Tibetan family which presents several linguistic challenges. We describe the corpus acquisition process and provide the details of our finite state-based approach to Burmese text normalization and G2P. Our experiments involve building a multi-speaker TTS system based on long short term memory (LSTM) recurrent neural network (RNN) models, which were previously shown to perform well for other languages in a low-resource setting. Our results indicate that the data and grammars that we are announcing are sufficient to build reasonably high-quality models comparable to other systems. We hope these resources will facilitate speech and language research on the Burmese language, which is considered by many to be low-resource due to the limited availability of free linguistic data.

pdf bib
Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems
Fei He | Shan-Hui Cathy Chu | Oddur Kjartansson | Clara Rivera | Anna Katanova | Alexander Gutkin | Isin Demirsahin | Cibu Johny | Martin Jansche | Supheakmungkol Sarin | Knot Pipatsrisawat
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present free high quality multi-speaker speech corpora for Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu, which are six of the twenty two official languages of India spoken by 374 million native speakers. The datasets are primarily intended for use in text-to-speech (TTS) applications, such as constructing multilingual voices or being used for speaker or language adaptation. Most of the corpora (apart from Marathi, which is a female-only database) consist of at least 2,000 recorded lines from female and male native speakers of the language. We present the methodological details behind corpora acquisition, which can be scaled to acquiring data for other languages of interest. We describe the experiments in building a multilingual text-to-speech model that is constructed by combining our corpora. Our results indicate that using these corpora results in good quality voices, with Mean Opinion Scores (MOS) > 3.6, for all the languages tested. We believe that these resources, released with an open-source license, and the described methodology will help in the progress of speech applications for the languages described and aid corpora development for other, smaller, languages of India and beyond.

2018

pdf bib
Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech
Jaka Aris Eko Wibawa | Supheakmungkol Sarin | Chenfang Li | Knot Pipatsrisawat | Keshan Sodimana | Oddur Kjartansson | Alexander Gutkin | Martin Jansche | Linne Ha
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
FonBund: A Library for Combining Cross-lingual Phonological Segment Data
Alexander Gutkin | Martin Jansche | Tatiana Merkulova
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
TTS for Low Resource Languages: A Bangla Synthesizer
Alexander Gutkin | Linne Ha | Martin Jansche | Knot Pipatsrisawat | Richard Sproat
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a text-to-speech (TTS) system designed for the dialect of Bengali spoken in Bangladesh. This work is part of an ongoing effort to address the needs of under-resourced languages. We propose a process for streamlining the bootstrapping of TTS systems for under-resourced languages. First, we use crowdsourcing to collect the data from multiple ordinary speakers, each speaker recording small amount of sentences. Second, we leverage an existing text normalization system for a related language (Hindi) to bootstrap a linguistic front-end for Bangla. Third, we employ statistical techniques to construct multi-speaker acoustic models using Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) and Hidden Markov Model (HMM) approaches. We then describe our experiments that show that the resulting TTS voices score well in terms of their perceived quality as measured by Mean Opinion Score (MOS) evaluations.

2014

pdf bib
Computer-Aided Quality Assurance of an Icelandic Pronunciation Dictionary
Martin Jansche
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We propose a model-driven method for ensuring the quality of pronunciation dictionaries. The key ingredient is computing an alignment between letter strings and phoneme strings, a standard technique in pronunciation modeling. The novel aspect of our method is the use of informative, parametric alignment models which are refined iteratively as they are tested against the data. We discuss the use of alignment failures as a signal for detecting and correcting problematic dictionary entries. We illustrate this method using an existing pronunciation dictionary for Icelandic. Our method is completely general and has been applied in the construction of pronunciation dictionaries for commercially deployed speech recognition systems in several languages.

2010

pdf bib
A Comparison of Features for Automatic Readability Assessment
Lijun Feng | Martin Jansche | Matt Huenerfauth | Noémie Elhadad
Coling 2010: Posters

2009

pdf bib
Named Entity Transcription with Pair n-Gram Models
Martin Jansche | Richard Sproat
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

pdf bib
OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language
Michael Riley | Cyril Allauzen | Martin Jansche
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts

2007

pdf bib
A Maximum Expected Utility Framework for Binary Sequence Labeling
Martin Jansche
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2005

pdf bib
Treebank transfer
Martin Jansche
Proceedings of the Ninth International Workshop on Parsing Technology

pdf bib
Maximum Expected F-Measure Training of Logistic Regression Models
Martin Jansche
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2003

pdf bib
Parametric Models of Linguistic Count Data
Martin Jansche
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

2002

pdf bib
Information Extraction from Voicemail Transcripts
Martin Jansche | Steven Abney
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

pdf bib
Named Entity Extraction with Conditional Markov Models and Classifiers
Martin Jansche
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

2001

pdf bib
Re-Engineering Letter-to-Sound Rules
Martin Jansche
Second Meeting of the North American Chapter of the Association for Computational Linguistics

1998

pdf bib
Abductive Reasoning for Syntactic Realization
Ralf Klabunde | Martin Jansche
Natural Language Generation