Other Workshops and Events (2016)


Contents



up

bib (full) Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology
Kristy Hollingshead | Lyle Ungar

pdf bib
Detecting late-life depression in Alzheimer’s disease through analysis of speech and language
Kathleen C. Fraser | Frank Rudzicz | Graeme Hirst

pdf bib
Towards Early Dementia Detection: Fusing Linguistic and Non-Linguistic Clinical Data
Joseph Bullard | Cecilia Ovesdotter Alm | Xumin Liu | Qi Yu | Rubén Proaño

pdf bib
Self-Reflective Sentiment Analysis
Benjamin Shickel | Martin Heesacker | Sherry Benton | Ashkan Ebadi | Paul Nickerson | Parisa Rashidi

pdf bib
Is Sentiment in Movies the Same as Sentiment in Psychotherapy? Comparisons Using a New Psychotherapy Sentiment Database
Michael Tanana | Aaron Dembe | Christina S. Soma | Zac Imel | David Atkins | Vivek Srikumar

pdf bib
Building a Motivational Interviewing Dataset
Verónica Pérez-Rosas | Rada Mihalcea | Kenneth Resnicow | Satinder Singh | Lawrence An

pdf bib
Crazy Mad Nutters: The Language of Mental Health
Jena D. Hwang | Kristy Hollingshead

pdf bib
The language of mental health problems in social media
George Gkotsis | Anika Oellrich | Tim Hubbard | Richard Dobson | Maria Liakata | Sumithra Velupillai | Rina Dutta

pdf bib
Exploring Autism Spectrum Disorders Using HLT
Julia Parish-Morris | Mark Liberman | Neville Ryant | Christopher Cieri | Leila Bateman | Emily Ferguson | Robert Schultz

pdf bib
Generating Clinically Relevant Texts: A Case Study on Life-Changing Events
Mayuresh Oak | Anil Behera | Titus Thomas | Cecilia Ovesdotter Alm | Emily Prud’hommeaux | Christopher Homan | Raymond Ptucha

pdf bib
Don’t Let Notes Be Misunderstood: A Negation Detection Method for Assessing Risk of Suicide in Mental Health Records
George Gkotsis | Sumithra Velupillai | Anika Oellrich | Harry Dean | Maria Liakata | Rina Dutta

pdf bib
Exploratory Analysis of Social Media Prior to a Suicide Attempt
Glen Coppersmith | Kim Ngo | Ryan Leary | Anthony Wood

pdf bib
CLPsych 2016 Shared Task: Triaging content in online peer-support forums
David N. Milne | Glen Pink | Ben Hachey | Rafael A. Calvo

pdf bib
Data61-CSIRO systems at the CLPsych 2016 Shared Task
Sunghwan Mac Kim | Yufei Wang | Stephen Wan | Cécile Paris

pdf bib
Predicting Post Severity in Mental Health Forums
Shervin Malmasi | Marcos Zampieri | Mark Dras

pdf bib
Classifying ReachOut posts with a radial basis function SVM
Chris Brew

pdf bib
Triaging Mental Health Forum Posts
Arman Cohan | Sydney Young | Nazli Goharian

pdf bib
Mental Distress Detection and Triage in Forum Posts: The LT3 CLPsych 2016 Shared Task System
Bart Desmet | Gilles Jacobs | Véronique Hoste

pdf bib
Text Analysis and Automatic Triage of Posts in a Mental Health Forum
Ehsaneddin Asgari | Soroush Nasiriany | Mohammad R.K. Mofrad

pdf bib
The UMD CLPsych 2016 Shared Task System: Text Representation for Predicting Triage of Forum Posts about Mental Health
Meir Friedenberg | Hadi Amiri | Hal Daumé III | Philip Resnik

pdf bib
Using Linear Classifiers for the Automatic Triage of Posts in the 2016 CLPsych Shared Task
Juri Opitz

pdf bib
The GW/UMD CLPsych 2016 Shared Task System
Ayah Zirikly | Varun Kumar | Philip Resnik

pdf bib
Semi-supervised CLPsych 2016 Shared Task System Submission
Nicolas Rey-Villamizar | Prasha Shrestha | Thamar Solorio | Farig Sadeque | Steven Bethard | Ted Pedersen

pdf bib
Combining Multiple Classifiers Using Global Ranking for ReachOut.com Post Triage
Chen-Kai Wang | Hong-Jie Dai | Chih-Wei Chen | Jitendra Jonnagaddala | Nai-Wen Chang

pdf bib
Classification of mental health forum posts
Glen Pink | Will Radford | Ben Hachey

pdf bib
Automatic Triage of Mental Health Online Forum Posts: CLPsych 2016 System Description
Hayda Almeida | Marc Queudot | Marie-Jean Meurs

pdf bib
Automatic Triage of Mental Health Forum Posts
Benjamin Shickel | Parisa Rashidi

pdf bib
Text-based experiments for Predicting mental health emergencies in online web forum posts
Hector-Hugo Franco-Penya | Liliana Mamani Sanchez


up

bib (full) Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
Alexandra Balahur | Erik van der Goot | Piek Vossen | Andres Montoyo

pdf bib
Sentiment Analysis - What are we talking about?
Alexandra Balahur

pdf bib
Sentiment, Subjectivity, and Social Analysis Go ToWork: An Industry View - Invited Talk
Seth Grimes

pdf bib
Rumor Identification and Belief Investigation on Twitter
Sardar Hamidian | Mona Diab

pdf bib
Modelling Valence and Arousal in Facebook posts
Daniel Preoţiuc-Pietro | H. Andrew Schwartz | Gregory Park | Johannes Eichstaedt | Margaret Kern | Lyle Ungar | Elisabeth Shulman

pdf bib
Purity Homophily in Social Networks - Invited Talk
Morteza Dehghani

pdf bib
Hit Songs’ Sentiments Harness Public Mood & Predict Stock Market
Rachel Harsley | Bhavesh Gupta | Barbara Di Eugenio | Huayi Li

pdf bib
Fashioning Data - A Social Media Perspective on Fast Fashion Brands
Rupak Chakraborty | Senjuti Kundu | Prakul Agarwal

pdf bib
Deep Learning for Sentiment Analysis - Invited Talk
Richard Socher

pdf bib
Sentiment Lexicon Creation using Continuous Latent Space and Neural Networks
Pedro Dias Cardoso | Anindya Roy

pdf bib
The Effect of Negators, Modals, and Degree Adverbs on Sentiment Composition
Svetlana Kiritchenko | Saif Mohammad

pdf bib
How can NLP Tasks Mutually Benefit Sentiment Analysis? A Holistic Approach to Sentiment Analysis
Lingjia Deng | Janyce Wiebe

pdf bib
An Unsupervised System for Visual Exploration of Twitter Conversations
Derrick Higgins | Michael Heilman | Adrianna Jelesnianska | Keith Ingersoll

pdf bib
Threat detection in online discussions
Aksel Wester | Lilja Øvrelid | Erik Velldal | Hugo Lewi Hammer

pdf bib
Classification of comment helpfulness to improve knowledge sharing among medical practitioners.
Pierre André Ménard | Caroline Barrière

pdf bib
Political Issue Extraction Model: A Novel Hierarchical Topic Model That Uses Tweets By Political And Non-Political Authors
Aditya Joshi | Pushpak Bhattacharyya | Mark Carman

pdf bib
Early text classification: a Naïve solution
Hugo Jair Escalante | Manuel Montes y Gomez | Luis Villasenor | Marcelo Luis Errecalde

pdf bib
Semi-supervised and unsupervised categorization of posts in Web discussion forums using part-of-speech information and minimal features
Krish Perumal | Graeme Hirst

pdf bib
Linguistic Understanding of Complaints and Praises in User Reviews
Guangyu Zhou | Kavita Ganesan

pdf bib
Reputation System: Evaluating Reputation among All Good Sellers
Vandana Jha | Savitha R | P Deepa Shenoy | Venugopal K R

pdf bib
Improve Sentiment Analysis of Citations with Author Modelling
Zheng Ma | Jinseok Nam | Karsten Weihe

pdf bib
Implicit Aspect Detection in Restaurant Reviews using Cooccurence of Words
Rrubaa Panchendrarajan | Nazick Ahamed | Brunthavan Murugaiah | Prakhash Sivakumar | Surangika Ranathunga | Akila Pemasiri

pdf bib
Domain Adaptation of Polarity Lexicon combining Term Frequency and Bootstrapping
Salud María Jiménez-Zafra | Maite Martin | M. Dolores Molina-Gonzalez | L. Alfonso Ureña-López

pdf bib
Do Enterprises Have Emotions?
Sven Buechel | Udo Hahn | Jan Goldenstein | Sebastian G. M. Händschke | Peter Walgenbach

pdf bib
A semantic-affective compositional approach for the affective labelling of adjective-noun and noun-noun pairs
Elisavet Palogiannidi | Elias Iosif | Polychronis Koutsakis | Alexandros Potamianos

pdf bib
Fracking Sarcasm using Neural Network
Aniruddha Ghosh | Tony Veale

pdf bib
An Hymn of an even Deeper Sentiment Analysis
Manfred Klenner

pdf bib
Sentiment Analysis in Twitter: A SemEval Perspective
Preslav Nakov

pdf bib
The Challenge of Sentiment Quantification
Fabrizio Sebastiani

pdf bib
A Practical Guide to Sentiment Annotation: Challenges and Solutions
Saif Mohammad

pdf bib
Emotions and NLP: Future Directions
Carlo Strapparava


up

bib (full) Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Claudia Leacock | Helen Yannakoudakis

pdf bib
The Effect of Multiple Grammatical Errors on Processing Non-Native Writing
Courtney Napoles | Aoife Cahill | Nitin Madnani

pdf bib
Text Readability Assessment for Second Language Learners
Menglin Xia | Ekaterina Kochmar | Ted Briscoe

pdf bib
Automatic Generation of Context-Based Fill-in-the-Blank Exercises Using Co-occurrence Likelihoods and Google n-grams
Jennifer Hill | Rahul Simha

pdf bib
Automated classification of collaborative problem solving interactions in simulated science tasks
Michael Flor | Su-Youn Yoon | Jiangang Hao | Lei Liu | Alina von Davier

pdf bib
Computer-assisted stylistic revision with incomplete and noisy feedback. A pilot study
Christian M. Meyer | Johann Frerik Koch

pdf bib
A Report on the Automatic Evaluation of Scientific Writing Shared Task
Vidas Daudaravicius | Rafael E. Banchs | Elena Volodina | Courtney Napoles

pdf bib
Topicality-Based Indices for Essay Scoring
Beata Beigman Klebanov | Michael Flor | Binod Gyawali

pdf bib
Predicting the Spelling Difficulty of Words for Language Learners
Lisa Beinborn | Torsten Zesch | Iryna Gurevych

pdf bib
Characterizing Text Difficulty with Word Frequencies
Xiaobin Chen | Detmar Meurers

pdf bib
Unsupervised Modeling of Topical Relevance in L2 Learner Text
Ronan Cummins | Helen Yannakoudakis | Ted Briscoe

pdf bib
UW-Stanford System Description for AESW 2016 Shared Task on Grammatical Error Detection
Dan Flickinger | Michael Goodman | Woodley Packard

pdf bib
Shallow Semantic Reasoning from an Incomplete Gold Standard for Learner Language
Levi King | Markus Dickinson

pdf bib
The NTNU-YZU System in the AESW Shared Task: Automated Evaluation of Scientific Writing Using a Convolutional Neural Network
Lung-Hao Lee | Bo-Lin Lin | Liang-Chih Yu | Yuen-Hsien Tseng

pdf bib
Automated scoring across different modalities
Anastassia Loukina | Aoife Cahill

pdf bib
Model Combination for Correcting Preposition Selection Errors
Nitin Madnani | Michael Heilman | Aoife Cahill

pdf bib
Pictogrammar: an AAC device based on a semantic grammar
Fernando Martínez-Santiago | Miguel Ángel García-Cumbreras | Arturo Montejo-Ráez | Manuel Carlos Díaz-Galiano

pdf bib
Detecting Context Dependence in Exercise Item Candidates Selected from Corpora
Ildikó Pilán

pdf bib
Feature-Rich Error Detection in Scientific Writing Using Logistic Regression
Madeline Remse | Mohsen Mesgar | Michael Strube

pdf bib
Bundled Gap Filling: A New Paradigm for Unambiguous Cloze Exercises
Michael Wojatzki | Oren Melamud | Torsten Zesch

pdf bib
Evaluation Dataset (DT-Grade) and Word Weighting Approach towards Constructed Short Answers Assessment in Tutorial Dialogue Context
Rajendra Banjade | Nabin Maharjan | Nobal Bikram Niraula | Dipesh Gautam | Borhan Samei | Vasile Rus

pdf bib
Linguistically Aware Information Retrieval: Providing Input Enrichment for Second Language Learners
Maria Chinkina | Detmar Meurers

pdf bib
Enhancing STEM Motivation through Personal and Communal Values: NLP for Assessment of Utility Value in Student Writing
Beata Beigman Klebanov | Jill Burstein | Judith Harackiewicz | Stacy Priniski | Matthew Mulholland

pdf bib
Cost-Effectiveness in Building a Low-Resource Morphological Analyzer for Learner Language
Scott Ledbetter | Markus Dickinson

pdf bib
Automatically Scoring Tests of Proficiency in Music Instruction
Nitin Madnani | Aoife Cahill | Brian Riordan

pdf bib
Combined Tree Kernel-based classifiers for Assessing Quality of Scientific Text
Liliana Mamani Sanchez | Hector-Hugo Franco-Penya

pdf bib
Augmenting Course Material with Open Access Textbooks
Smitha Milli | Marti A. Hearst

pdf bib
Exploring the Intersection of Short Answer Assessment, Authorship Attribution, and Plagiarism Detection
Björn Rudzewitz

pdf bib
Sentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction
Allen Schmaltz | Yoon Kim | Alexander M. Rush | Stuart Shieber

pdf bib
Combining Off-the-shelf Grammar and Spelling Tools for the Automatic Evaluation of Scientific Writing (AESW) Shared Task 2016
René Witte | Bahar Sateli

pdf bib
Candidate re-ranking for SMT-based grammatical error correction
Zheng Yuan | Ted Briscoe | Mariano Felice

pdf bib
Spoken Text Difficulty Estimation Using Linguistic Features
Su-Youn Yoon | Yeonsuk Cho | Diane Napolitano

pdf bib
Automatically Extracting Topical Components for a Response-to-Text Writing Assessment
Zahra Rahimi | Diane Litman

pdf bib
Sentence Similarity Measures for Fine-Grained Estimation of Topical Relevance in Learner Essays
Marek Rei | Ronan Cummins

pdf bib
Insights from Russian second language readability classification: complexity-dependent training requirements, and feature evaluation of multiple categories
Robert Reynolds

pdf bib
Investigating Active Learning for Short-Answer Scoring
Andrea Horbach | Alexis Palmer



up

bib (full) Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

pdf bib
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)
Maciej Ogrodniczuk | Vincent Ng

pdf bib
Sense Anaphoric Pronouns: Am I One?
Marta Recasens | Zhichao Hu | Olivia Rhinehart

pdf bib
Experiments on bridging across languages and genres
Yulia Grishina

pdf bib
Bridging Relations in Polish: Adaptation of Existing Typologies
Maciej Ogrodniczuk | Magdalena Zawisławska

pdf bib
Beyond Identity Coreference: Contrasting Indicators of Textual Coherence in English and German
Kerstin Kunz | Ekaterina Lapshinova-Koltunski | José Manuel Martínez

pdf bib
Exploring the steps of Verb Phrase Ellipsis
Zhengzhong Liu | Edgar Gonzàlez Pellicer | Daniel Gillick

pdf bib
Anaphoricity in Connectives: A Case Study on German
Manfred Stede | Yulia Grishina

pdf bib
Abstract Coreference in a Multilingual Perspective: a View on Czech and German
Anna Nedoluzhko | Ekaterina Lapshinova-Koltunski

pdf bib
Antecedent Prediction Without a Pipeline
Sam Wiseman | Alexander M. Rush | Stuart Shieber

pdf bib
Bridging Corpus for Russian in comparison with Czech
Anna Roitberg | Anna Nedoluzhko

pdf bib
Coreference Resolution for the Basque Language with BART
Ander Soraluze | Olatz Arregi | Xabier Arregi | Arantza Díaz de Ilarraza | Mijail Kabadjov | Massimo Poesio

pdf bib
Error analysis for anaphora resolution in Russian: new challenging issues for anaphora resolution task in a morphologically rich language
Svetlana Toldova | Ilya Azerkovich | Alina Ladygina | Anna Roitberg | Maria Vasilyeva

pdf bib
How to Handle Split Antecedents in Tamil?
Vijay Sundar Ram | Sobha Lalitha Devi

pdf bib
When Annotation Schemes Change Rules Help: A Configurable Approach to Coreference Resolution beyond OntoNotes
Amir Zeldes | Shuo Zhang







up

bib (full) Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib
Proceedings of the 5th Workshop on Automated Knowledge Base Construction
Jay Pujara | Tim Rocktaschel | Danqi Chen | Sameer Singh

pdf bib
Using Graphs of Classifiers to Impose Constraints on Semi-supervised Relation Extraction
Lidong Bing | William Cohen | Bhuwan Dhingra | Richard Wang

pdf bib
Discovering Entity Knowledge Bases on the Web
Andrew Chisholm | Will Radford | Ben Hachey

pdf bib
IKE - An Interactive Tool for Knowledge Extraction
Bhavana Dalvi | Sumithra Bhakthavatsalam | Chris Clark | Peter Clark | Oren Etzioni | Anthony Fader | Dirk Groeneveld

pdf bib
Incorporating Selectional Preferences in Multi-hop Relation Extraction
Rajarshi Das | Arvind Neelakantan | David Belanger | Andrew McCallum

pdf bib
Knowledge Base Population for Organization Mentions in Email
Ning Gao | Mark Dredze | Douglas Oard

pdf bib
Enriching Wikidata with Frame Semantics
Hatem Mousselly-Sergieh | Iryna Gurevych

pdf bib
Demonyms and Compound Relational Nouns in Nominal Open IE
Harinder Pal | Mausam

pdf bib
But What Do We Actually Know?
Simon Razniewski | Fabian Suchanek | Werner Nutt

pdf bib
Learning Knowledge Base Inference with Neural Theorem Provers
Tim Rocktäschel | Sebastian Riedel

pdf bib
The Physics of Text: Ontological Realism in Information Extraction
Stuart Russell | Ole Torp Lassen | Justin Uang | Wei Wang

pdf bib
Know2Look: Commonsense Knowledge for Visual Search
Sreyasi Nag Chowdhury | Niket Tandon | Gerhard Weikum

pdf bib
Row-less Universal Schema
Patrick Verga | Andrew McCallum

pdf bib
An Attentive Neural Architecture for Fine-grained Entity Type Classification
Sonse Shimaoka | Pontus Stenetorp | Kentaro Inui | Sebastian Riedel

pdf bib
Regularizing Relation Representations by First-order Implications
Thomas Demeester | Tim Rocktäschel | Sebastian Riedel

pdf bib
Applying Universal Schemas for Domain Specific Ontology Expansion
Paul Groth | Sujit Pal | Darin McBeath | Brad Allen | Ron Daniel

pdf bib
Design of Word Association Games using Dialog Systems for Acquisition of Word Association Knowledge
Yuichiro Machida | Daisuke Kawahara | Sadao Kurohashi | Manabu Sassano

pdf bib
Call for Discussion: Building a New Standard Dataset for Relation Extraction Tasks
Teresa Martin | Fiete Botschen | Ajay Nagesh | Andrew McCallum

pdf bib
A Comparison of Weak Supervision methods for Knowledge Base Construction
Ameet Soni | Dileep Viswanathan | Niranjan Pachaiyappan | Sriraam Natarajan

pdf bib
A Factorization Machine Framework for Testing Bigram Embeddings in Knowledgebase Completion
Johannes Welbl | Guillaume Bouchard | Sebastian Riedel



up

bib (full) Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

pdf bib
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)
Guillaume Cabanac | Muthu Kumar Chandrasekaran | Ingo Frommholz | Kokil Jaidka | Min-Yen Kan | Philipp Mayr | Dietmar Wolfram

pdf bib
Bibliometrics, Information Retrieval and Natural Language Processing: Natural Synergies to Support Digital Library Research
Dietmar Wolfram

pdf bib
Multiple In-text Reference Aggregation Phenomenon
Marc Bertin | Iana Atanassova

pdf bib
Post Retraction Citations in Context
Gali Halevi | Judit Bar-Ilan

pdf bib
Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches
Masaki Eto

pdf bib
Making Sense of Massive Amounts of Scientific Publications: the Scientific Knowledge Miner Project
Francesco Ronzano | Ana Freire | Diego Saez-Trumper | Horacio Saggion

pdf bib
Exploring the Leading Authors and Journals in Major Topics by Citation Sentences and Topic Modeling
Ha Jin Kim | Juyoung An | Yoo Kyung Jeong | Min Song

pdf bib
What Papers Should I Cite from my Reading List? User Evaluation of a Manuscript Preparatory Assistive Task
Aravind Sesagiri Raamkumar | Schubert Foo | Natalie Pang

pdf bib
Delineating Fields Using Mathematical Jargon
Jevin West | Jason Portenoy

pdf bib
A Study of Reuse and Plagiarism in Speech and Natural Language Processing papers
Joseph Mariani | Gil Francopoulo | Patrick Paroubek

pdf bib
How do Practitioners, PhD Students and Postdocs in the Social Sciences Assess Topic-specific Recommendations?
Philipp Mayr

pdf bib
Overview of the CL-SciSumm 2016 Shared Task
Kokil Jaidka | Muthu Kumar Chandrasekaran | Sajal Rustagi | Min-Yen Kan

pdf bib
Lexical and Syntactic cues to identify Reference Scope of Citance
Peeyush Aggarwal | Richa Sharma

pdf bib
University of Houston at CL-SciSumm 2016: SVMs with tree kernels and Sentence Similarity
Luis Moraes | Shahryar Baki | Rakesh Verma | Daniel Lee

pdf bib
Identifying Referenced Text in Scientific Publications by Summarisation and Classification Techniques
Stefan Klampfl | Andi Rexha | Roman Kern

pdf bib
PolyU at CL-SciSumm 2016
Ziqiang Cao | Wenjie Li | Dapeng Wu

pdf bib
Recognizing Reference Spans and Classifying their Discourse Facets
Kun Lu | Jin Mao | Gang Li | Jian Xu

pdf bib
RALI System Description for CL-SciSumm 2016 Shared Task
Bruno Malenfant | Guy Lapalme

pdf bib
CIST System for CL-SciSumm 2016 Shared Task
Lei Li | Liyuan Mao | Yazhao Zhang | Junqi Chi | Taiwen Huang | Xiaoyue Cong | Heng Peng

pdf bib
NEAL: A Neurally Enhanced Approach to Linking Citation and Reference
Tadashi Nomoto

pdf bib
Trainable Citation-enhanced Summarization of Scientific Articles
Horacio Saggion | Ahmed AbuRa’ed | Francesco Ronzano


up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on Representation Learning for NLP

pdf bib
Proceedings of the 1st Workshop on Representation Learning for NLP
Phil Blunsom | Kyunghyun Cho | Shay Cohen | Edward Grefenstette | Karl Moritz Hermann | Laura Rimell | Jason Weston | Scott Wen-tau Yih

pdf bib
Explaining Predictions of Non-Linear Classifiers in NLP
Leila Arras | Franziska Horn | Grégoire Montavon | Klaus-Robert Müller | Wojciech Samek

pdf bib
Joint Learning of Sentence Embeddings for Relevance and Entailment
Petr Baudiš | Silvestr Stanko | Jan Šedivý

pdf bib
A Joint Model for Word Embedding and Word Morphology
Kris Cao | Marek Rei

pdf bib
On the Compositionality and Semantic Interpretation of English Noun Compounds
Corina Dima

pdf bib
Functional Distributional Semantics
Guy Emerson | Ann Copestake

pdf bib
Assisting Discussion Forum Users using Deep Recurrent Neural Networks
Jacob Hagstedt P Suorra | Olof Mogren

pdf bib
Adjusting Word Embeddings with Semantic Intensity Orders
Joo-Kyung Kim | Marie-Catherine de Marneffe | Eric Fosler-Lussier

pdf bib
Towards Abstraction from Extraction: Multiple Timescale Gated Recurrent Unit for Summarization
Minsoo Kim | Dennis Singh Moirangthem | Minho Lee

pdf bib
An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
Jey Han Lau | Timothy Baldwin

pdf bib
Quantifying the Vanishing Gradient and Long Distance Dependency Problem in Recursive Neural Networks and Recursive LSTMs
Phong Le | Willem Zuidema

pdf bib
LSTM-Based Mixture-of-Experts for Knowledge-Aware Dialogues
Phong Le | Marc Dymetman | Jean-Michel Renders

pdf bib
Mapping Unseen Words to Task-Trained Embedding Spaces
Pranava Swaroop Madhyastha | Mohit Bansal | Kevin Gimpel | Karen Livescu

pdf bib
Multilingual Modal Sense Classification using a Convolutional Neural Network
Ana Marasović | Anette Frank

pdf bib
Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
Antonio Valerio Miceli Barone

pdf bib
Decomposing Bilexical Dependencies into Semantic and Syntactic Vectors
Jeff Mitchell

pdf bib
Learning Semantic Relatedness in Community Question Answering Using Neural Models
Henry Nassif | Mitra Mohtarami | James Glass

pdf bib
Learning Text Similarity with Siamese Recurrent Networks
Paul Neculoiu | Maarten Versteegh | Mihai Rotaru

pdf bib
A Two-stage Approach for Extending Event Detection to New Types via Neural Networks
Thien Huu Nguyen | Lisheng Fu | Kyunghyun Cho | Ralph Grishman

pdf bib
Parameterized context windows in Random Indexing
Tobias Norlund | David Nilsson | Magnus Sahlgren

pdf bib
Making Sense of Word Embeddings
Maria Pelevina | Nikolay Arefiev | Chris Biemann | Alexander Panchenko

pdf bib
Pair Distance Distribution: A Model of Semantic Representation
Yonatan Ramni | Oded Maimon | Evgeni Khmelnitsky

pdf bib
Measuring Semantic Similarity of Words Using Concept Networks
Gábor Recski | Eszter Iklódi | Katalin Pajkossy | András Kornai

pdf bib
Using Embedding Masks for Word Categorization
Stefan Ruseti | Traian Rebedea | Stefan Trausan-Matu

pdf bib
Sparsifying Word Representations for Deep Unordered Sentence Modeling
Prasanna Sattigeri | Jayaraman J. Thiagarajan

pdf bib
Why “Blow Out”? A Structural Analysis of the Movie Dialog Dataset
Richard Searle | Megan Bingham-Walker

pdf bib
Learning Word Importance with the Neural Bag-of-Words Model
Imran Sheikh | Irina Illina | Dominique Fohr | Georges Linarès

pdf bib
A Vector Model for Type-Theoretical Semantics
Konstantin Sokolov

pdf bib
Towards Generalizable Sentence Embeddings
Eleni Triantafillou | Jamie Ryan Kiros | Raquel Urtasun | Richard Zemel

pdf bib
Domain Adaptation for Neural Networks by Parameter Augmentation
Yusuke Watanabe | Kazuma Hashimoto | Yoshimasa Tsuruoka

pdf bib
Neural Associative Memory for Dual-Sequence Modeling
Dirk Weissenborn


up

pdf (full)
bib (full)
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)
Annemarie Friedrich | Katrin Tomanek

pdf bib
Building a Cross-document Event-Event Relation Corpus
Yu Hong | Tongtao Zhang | Tim O’Gorman | Sharone Horowit-Hendler | Heng Ji | Martha Palmer

pdf bib
Annotating the Little Prince with Chinese AMRs
Bin Li | Yuan Wen | Weiguang Qu | Lijun Bu | Nianwen Xue

pdf bib
Converting SynTagRus Dependency Treebank into Penn Treebank Style
Alex Luu | Sophia A. Malamud | Nianwen Xue

pdf bib
A Discourse-Annotated Corpus of Conjoined VPs
Bonnie Webber | Rashmi Prasad | Alan Lee | Aravind Joshi

pdf bib
Annotating Spelling Errors in German Texts Produced by Primary School Children
Ronja Laarmann-Quante | Lukas Knichel | Stefanie Dipper | Carina Betken

pdf bib
Supersense tagging with inter-annotator disagreement
Héctor Martínez Alonso | Anders Johannsen | Barbara Plank

pdf bib
Filling in the Blanks in Understanding Discourse Adverbials: Consistency, Conflict, and Context-Dependence in a Crowdsourced Elicitation Task
Hannah Rohde | Anna Dickinson | Nathan Schneider | Christopher N. L. Clark | Annie Louis | Bonnie Webber

pdf bib
Comparison of Annotating Methods for Named Entity Corpora
Kanako Komiya | Masaya Suzuki | Tomoya Iwakura | Minoru Sasaki | Hiroyuki Shinnou

pdf bib
Different Flavors of GUM: Evaluating Genre and Sentence Type Effects on Multilayer Corpus Annotation Quality
Amir Zeldes | Dan Simonson

pdf bib
Addressing Annotation Complexity: The Case of Annotating Ideological Perspective in Egyptian Social Media
Heba Elfardy | Mona Diab

pdf bib
Evaluating Inter-Annotator Agreement on Historical Spelling Normalization
Marcel Bollmann | Stefanie Dipper | Florian Petran

pdf bib
A Corpus of Preposition Supersenses
Nathan Schneider | Jena D. Hwang | Vivek Srikumar | Meredith Green | Abhijit Suresh | Kathryn Conger | Tim O’Gorman | Martha Palmer

pdf bib
Focus Annotation of Task-based Data: Establishing the Quality of Crowd Annotation
Kordula De Kuthy | Ramon Ziai | Detmar Meurers

pdf bib
Part of Speech Annotation of a Turkish-German Code-Switching Corpus
Özlem Çetinoğlu | Çağrı Çöltekin

pdf bib
Dependency Annotation Choices: Assessing Theoretical and Practical Issues of Universal Dependencies
Kim Gerdes | Sylvain Kahane

pdf bib
Conversion from Paninian Karakas to Universal Dependencies for Hindi Dependency Treebank
Juhi Tandon | Himani Chaudhry | Riyaz Ahmad Bhat | Dipti Sharma

pdf bib
Phrase Generalization: a Corpus Study in Multi-Document Abstracts and Original News Alignments
Ariani Di-Felippo | Ani Nenkova

pdf bib
Generating Disambiguating Paraphrases for Structurally Ambiguous Sentences
Manjuan Duan | Ethan Hill | Michael White

pdf bib
Applying Universal Dependency to the Arapaho Language
Irina Wagner | Andrew Cowell | Jena D. Hwang

pdf bib
Annotating the discourse and dialogue structure of SMS message conversations
Nianwen Xue | Qishen Su | Sooyoung Jeong

pdf bib
Creating a Novel Geolocation Corpus from Historical Texts
Grant DeLozier | Ben Wing | Jason Baldridge | Scott Nesbit


up

pdf (full)
bib (full)
Proceedings of the 12th Workshop on Multiword Expressions

pdf bib
Proceedings of the 12th Workshop on Multiword Expressions
Valia Kordoni | Kostadin Cholakov | Markus Egg | Stella Markantonatou | Preslav Nakov

pdf bib
Learning Paraphrasing for Multiword Expressions
Seid Muhie Yimam | Héctor Martínez Alonso | Martin Riedl | Chris Biemann

pdf bib
Exploring Long-Term Temporal Trends in the Use of Multiword Expressions
Tal Daniel | Mark Last

pdf bib
Lexical Variability and Compositionality: Investigating Idiomaticity with Distributional Semantic Models
Marco Silvio Giuseppe Senaldi | Gianluca E. Lebani | Alessandro Lenci

pdf bib
Filtering and Measuring the Intrinsic Quality of Human Compositionality Judgments
Carlos Ramisch | Silvio Cordeiro | Aline Villavicencio

pdf bib
Graph-based Clustering of Synonym Senses for German Particle Verbs
Moritz Wittmann | Marion Weller-Di Marco | Sabine Schulte im Walde

pdf bib
Accounting ngrams and multi-word terms can improve topic models
Michael Nokel | Natalia Loukachevitch

pdf bib
Top a Splitter: Using Distributional Semantics for Improving Compound Splitting
Patrick Ziering | Stefan Müller | Lonneke van der Plas

pdf bib
Using Word Embeddings for Improving Statistical Machine Translation of Phrasal Verbs
Kostadin Cholakov | Valia Kordoni

pdf bib
Modeling the Non-Substitutability of Multiword Expressions with Distributional Semantics and a Log-Linear Model
Meghdad Farahmand | James Henderson

pdf bib
Phrase Representations for Multiword Expressions
Joël Legrand | Ronan Collobert

pdf bib
Representing Support Verbs in FrameNet
Miriam R. L. Petruck | Michael Ellsworth

pdf bib
Inherently Pronominal Verbs in Czech: Description and Conversion Based on Treebank Annotation
Zdeňka Urešová | Eduard Bejček | Jan Hajič

pdf bib
Using collocational features to improve automated scoring of EFL texts
Yves Bestgen

pdf bib
A study on the production of collocations by European Portuguese learners
Ângela Costa | Luísa Coheur | Teresa Lino

pdf bib
Extraction and Recognition of Polish Multiword Expressions using Wikipedia and Finite-State Automata
Paweł Chrząszcz

pdf bib
Impact of MWE Resources on Multiword Recognition
Martin Riedl | Chris Biemann

pdf bib
A Word Embedding Approach to Identifying Verb-Noun Idiomatic Combinations
Waseem Gharbieh | Virendra Bhavsar | Paul Cook


up

pdf (full)
bib (full)
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning

pdf bib
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning
Anna Korhonen | Alessandro Lenci | Brian Murphy | Thierry Poibeau | Aline Villavicencio

pdf bib
Automated Discourse Analysis of Narrations by Adolescents with Autistic Spectrum Disorder
Michaela Regneri | Diane King

pdf bib
Detection of Alzheimer’s disease based on automatic analysis of common objects descriptions
Laura Hernández-Domínguez | Edgar García-Cano | Sylvie Ratté | Gerardo Sierra-Martínez

pdf bib
Conversing with the elderly in Latin America: a new cohort for multimodal, multilingual longitudinal studies on aging
Laura Hernández-Domínguez | Sylvie Ratté | Boyd Davis | Charlene Pope

pdf bib
Leveraging Annotators’ Gaze Behaviour for Coreference Resolution
Joe Cheri | Abhijit Mishra | Pushpak Bhattacharyya

pdf bib
From alignment of etymological data to phylogenetic inference via population genetics
Javad Nouri | Roman Yangarber

pdf bib
An incremental model of syntactic bootstrapping
Christos Christodoulopoulos | Dan Roth | Cynthia Fisher

pdf bib
Longitudinal Studies of Variation Sets in Child-directed Speech
Mats Wirén | Kristina Nilsson Björkenstam | Gintarė Grigonytė | Elisabet Eir Cortes

pdf bib
Learning Phone Embeddings for Word Segmentation of Child-Directed Speech
Jianqiang Ma | Çağrı Çöltekin | Erhard Hinrichs

pdf bib
Generalization in Artificial Language Learning: Modelling the Propensity to Generalize
Raquel G. Alhama | Willem Zuidema

pdf bib
Explicit Causal Connections between the Acquisition of Linguistic Tiers: Evidence from Dynamical Systems Modeling
Daniel Spokoyny | Jeremy Irvin | Fermin Moscoso del Prado Martin

pdf bib
Modelling the informativeness and timing of non-verbal cues in parent-child interaction
Kristina Nilsson Björkenstam | Mats Wirén | Robert Östling


up

pdf (full)
bib (full)
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Micha Elsner | Sandra Kuebler

pdf bib
Mining linguistic tone patterns with symbolic representation
Shuo Zhang

pdf bib
The SIGMORPHON 2016 Shared Task—Morphological Reinflection
Ryan Cotterell | Christo Kirov | John Sylak-Glassman | David Yarowsky | Jason Eisner | Mans Hulden

pdf bib
Morphological reinflection with convolutional neural networks
Robert Östling

pdf bib
EHU at the SIGMORPHON 2016 Shared Task. A Simple Proposal: Grapheme-to-Phoneme for Inflection
Iñaki Alegria | Izaskun Etxeberria

pdf bib
Morphological Reinflection via Discriminative String Transduction
Garrett Nicolai | Bradley Hauer | Adam St Arnaud | Grzegorz Kondrak

pdf bib
Morphological reinflection with conditional random fields and unsupervised features
Ling Liu | Lingshuang Jack Mao

pdf bib
Improving Sequence to Sequence Learning for Morphological Inflection Generation: The BIU-MIT Systems for the SIGMORPHON 2016 Shared Task for Morphological Reinflection
Roee Aharoni | Yoav Goldberg | Yonatan Belinkov

pdf bib
Evaluating Sequence Alignment for Learning Inflectional Morphology
David King

pdf bib
Using longest common subsequence and character models to predict word forms
Alexey Sorokin

pdf bib
MED: The LMU System for the SIGMORPHON 2016 Shared Task on Morphological Reinflection
Katharina Kann | Hinrich Schütze

pdf bib
The Columbia University - New York University Abu Dhabi SIGMORPHON 2016 Morphological Reinflection Shared Task Submission
Dima Taji | Ramy Eskander | Nizar Habash | Owen Rambow

pdf bib
Letter Sequence Labeling for Compound Splitting
Jianqiang Ma | Verena Henrich | Erhard Hinrichs

pdf bib
Automatic Detection of Intra-Word Code-Switching
Dong Nguyen | Leonie Cornips

pdf bib
Read my points: Effect of animation type when speech-reading from EMA data
Kristy James | Martijn Wieling

pdf bib
Predicting the Direction of Derivation in English Conversion
Max Kisselew | Laura Rimell | Alexis Palmer | Sebastian Padó

pdf bib
Morphological Segmentation Can Improve Syllabification
Garrett Nicolai | Lei Yao | Grzegorz Kondrak

pdf bib
Towards a Formal Representation of Components of German Compounds
Thierry Declerck | Piroska Lendvai

pdf bib
Towards robust cross-linguistic comparisons of phonological networks
Philippa Shoemark | Sharon Goldwater | James Kirby | Rik Sarkar

pdf bib
Morphotactics as Tier-Based Strictly Local Dependencies
Alëna Aksënova | Thomas Graf | Sedigheh Moradi

pdf bib
A Multilinear Approach to the Unsupervised Learning of Morphology
Anthony Meyer | Markus Dickinson

pdf bib
Inferring Morphotactics from Interlinear Glossed Text: Combining Clustering and Precision Grammars
Olga Zamaraeva


up

pdf (full)
bib (full)
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Nils Reiter | Beatrice Alex | Kalliopi A. Zervanou

pdf bib
Brave New World: Uncovering Topical Dynamics in the ACL Anthology Reference Corpus Using Term Life Cycle Information
Anne-Kathrin Schumann

pdf bib
Analysis of Policy Agendas: Lessons Learned from Automatic Topic Classification of Croatian Political Texts
Mladen Karan | Jan Šnajder | Daniela Širinić | Goran Glavaš

pdf bib
Searching Four-Millenia-Old Digitized Documents: A Text Retrieval System for Egyptologists
Estíbaliz Iglesias-Franjo | Jesús Vilares

pdf bib
Old Swedish Part-of-Speech Tagging between Variation and External Knowledge
Yvonne Adesam | Gerlof Bouma

pdf bib
Code-Switching Ubique Est - Language Identification and Part-of-Speech Tagging for Historical Mixed Text
Sarah Schulz | Mareike Keller

pdf bib
Dealing with word-internal modification and spelling variation in data-driven lemmatization
Fabian Barteld | Ingrid Schröder | Heike Zinsmeister

pdf bib
You Shall Know People by the Company They Keep: Person Name Disambiguation for Social Network Construction
Mariona Coll Ardanuy | Maarten van den Bos | Caroline Sporleder

pdf bib
Deriving Players & Themes in the Regesta Imperii using SVMs and Neural Networks
Juri Opitz | Anette Frank

pdf bib
Semi-automated annotation of page-based documents within the Genre and Multimodality framework
Tuomo Hiippala

pdf bib
Nomen Omen. Enhancing the Latin Morphological Analyser Lemlat with an Onomasticon
Marco Budassi | Marco Passarotti

pdf bib
How Do Cultural Differences Impact the Quality of Sarcasm Annotation?: A Case Study of Indian Annotators and American Text
Aditya Joshi | Pushpak Bhattacharyya | Mark Carman | Jaya Saraswati | Rajita Shukla

pdf bib
Combining Phonology and Morphology for the Normalization of Historical Texts
Izaskun Etxeberria | Iñaki Alegria | Larraitz Uria | Mans Hulden

pdf bib
Towards Building a Political Protest Database to Explain Changes in the Welfare State
Çağıl Sönmez | Arzucan Özgür | Erdem Yörük

pdf bib
An Assessment of Experimental Protocols for Tracing Changes in Word Semantics Relative to Accuracy and Reliability
Johannes Hellrich | Udo Hahn

pdf bib
Universal Morphology for Old Hungarian
Eszter Simon | Veronika Vincze

pdf bib
Automatic Identification of Suicide Notes from Linguistic and Sentiment Features
Annika Marie Schoene | Nina Dethlefs

pdf bib
Towards a text analysis system for political debates
Dieu-Thu Le | Ngoc Thang Vu | Andre Blessing

pdf bib
Whodunit... and to Whom? Subjects, Objects, and Actions in Research Articles on American Labor Unions
Vilja Hulden

pdf bib
An NLP Pipeline for Coptic
Amir Zeldes | Caroline T. Schroeder

pdf bib
Automatic discovery of Latin syntactic changes
Micha Elsner | Emily Lane

pdf bib
Information-based Modeling of Diachronic Linguistic Change: from Typicality to Productivity
Stefania Degaetano-Ortlieb | Elke Teich


up

bib (full) Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

pdf bib
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers
Ondřej Bojar | Christian Buck | Rajen Chatterjee | Christian Federmann | Liane Guillou | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Aurélie Névéol | Mariana Neves | Pavel Pecina | Martin Popel | Philipp Koehn | Christof Monz | Matteo Negri | Matt Post | Lucia Specia | Karin Verspoor | Jörg Tiedemann | Marco Turchi

pdf bib
Cross-language Projection of Dependency Trees with Constrained Partial Parsing for Tree-to-Tree Machine Translation
Yu Shen | Chenhui Chu | Fabien Cromieres | Sadao Kurohashi

pdf bib
Improving Pronoun Translation by Modeling Coreference Uncertainty
Ngoc Quang Luong | Andrei Popescu-Belis

pdf bib
Modeling verbal inflection for English to German SMT
Anita Ramm | Alexander Fraser

pdf bib
Modeling Selectional Preferences of Verbs and Nouns in String-to-Tree Machine Translation
Maria Nădejde | Alexandra Birch | Philipp Koehn

pdf bib
Modeling Complement Types in Phrase-Based SMT
Marion Weller-Di Marco | Alexander Fraser | Sabine Schulte im Walde

pdf bib
Alignment-Based Neural Machine Translation
Tamer Alkhouli | Gabriel Bretschner | Jan-Thorsten Peter | Mohammed Hethnawi | Andreas Guta | Hermann Ney

pdf bib
Neural Network-based Word Alignment through Score Aggregation
Joël Legrand | Michael Auli | Ronan Collobert

pdf bib
Using Factored Word Representation in Neural Network Language Models
Jan Niehues | Thanh-Le Ha | Eunah Cho | Alex Waibel

pdf bib
Linguistic Input Features Improve Neural Machine Translation
Rico Sennrich | Barry Haddow

pdf bib
A Framework for Discriminative Rule Selection in Hierarchical Moses
Fabienne Braune | Alexander Fraser | Hal Daumé III | Aleš Tamchyna

pdf bib
Fast and highly parallelizable phrase table for statistical machine translation
Nikolay Bogoychev | Hieu Hoang

pdf bib
A Comparative Study on Vocabulary Reduction for Phrase Table Smoothing
Yunsu Kim | Andreas Guta | Joern Wuebker | Hermann Ney

pdf bib
Examining the Relationship between Preordering and Word Order Freedom in Machine Translation
Joachim Daiber | Miloš Stanojević | Wilker Aziz | Khalil Sima’an


up

bib (full) Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

bib
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Ondřej Bojar | Christian Buck | Rajen Chatterjee | Christian Federmann | Liane Guillou | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Aurélie Névéol | Mariana Neves | Pavel Pecina | Martin Popel | Philipp Koehn | Christof Monz | Matteo Negri | Matt Post | Lucia Specia | Karin Verspoor | Jörg Tiedemann | Marco Turchi

pdf bib
Findings of the 2016 Conference on Machine Translation
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Varvara Logacheva | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Martin Popel | Matt Post | Raphael Rubino | Carolina Scarton | Lucia Specia | Marco Turchi | Karin Verspoor | Marcos Zampieri

pdf bib
Results of the WMT16 Metrics Shared Task
Ondřej Bojar | Yvette Graham | Amir Kamran | Miloš Stanojević

pdf bib
Results of the WMT16 Tuning Shared Task
Bushra Jawaid | Amir Kamran | Miloš Stanojević | Ondřej Bojar

pdf bib
LIMSI@WMT’16: Machine Translation of News
Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Ophélie Lacroix | Elena Knyazeva | Thomas Lavergne | Guillaume Wisniewski | François Yvon

pdf bib
TÜBİTAK SMT System Submission for WMT2016
Emre Bektaş | Ertuğrul Yilmaz | Coşkun Mermer | İlknur Durgar El-Kahlout

pdf bib
ParFDA for Instance Selection for Statistical Machine Translation
Ergun Biçici

pdf bib
Sheffield Systems for the English-Romanian WMT Translation Task
Frédéric Blain | Xingyi Song | Lucia Specia

pdf bib
MetaMind Neural Machine Translation System for WMT 2016
James Bradbury | Richard Socher

pdf bib
NYU-MILA Neural Machine Translation Systems for WMT’16
Junyoung Chung | Kyunghyun Cho | Yoshua Bengio

pdf bib
The JHU Machine Translation Systems for WMT 2016
Shuoyang Ding | Kevin Duh | Huda Khayrallah | Philipp Koehn | Matt Post

pdf bib
Yandex School of Data Analysis approach to English-Turkish translation at WMT16 News Translation Task
Anton Dvorkovich | Sergey Gubanov | Irina Galinskaya

pdf bib
Hybrid Morphological Segmentation for Phrase-Based Machine Translation
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo

pdf bib
The AFRL-MITLL WMT16 News-Translation Task Systems
Jeremy Gwinnup | Tim Anderson | Grant Erdmann | Katherine Young | Michaeel Kazi | Elizabeth Salesky | Brian Thompson

pdf bib
The Karlsruhe Institute of Technology Systems for the News Translation Task in WMT 2016
Thanh-Le Ha | Eunah Cho | Jan Niehues | Mohammed Mediani | Matthias Sperber | Alexandre Allauzen | Alexander Waibel

pdf bib
The Edinburgh/LMU Hierarchical Machine Translation System for WMT 2016
Matthias Huck | Alexander Fraser | Barry Haddow

pdf bib
The AMU-UEDIN Submission to the WMT16 News Translation Task: Attention-based NMT Models as Feature Functions in Phrase-based SMT
Marcin Junczys-Dowmunt | Tomasz Dwojak | Rico Sennrich

pdf bib
NRC Russian-English Machine Translation System for WMT 2016
Chi-kiu Lo | Colin Cherry | George Foster | Darlene Stewart | Rabib Islam | Anna Kazantseva | Roland Kuhn

pdf bib
Merged bilingual trees based on Universal Dependencies in Machine Translation
David Mareček

pdf bib
PROMT Translation Systems for WMT 2016 Translation Tasks
Alexander Molchanov | Fedor Bykov

pdf bib
The QT21/HimL Combined Machine Translation System
Jan-Thorsten Peter | Tamer Alkhouli | Hermann Ney | Matthias Huck | Fabienne Braune | Alexander Fraser | Aleš Tamchyna | Ondřej Bojar | Barry Haddow | Rico Sennrich | Frédéric Blain | Lucia Specia | Jan Niehues | Alex Waibel | Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Elena Knyazeva | Thomas Lavergne | François Yvon | Mārcis Pinnis | Stella Frank

pdf bib
The RWTH Aachen University English-Romanian Machine Translation System for WMT 2016
Jan-Thorsten Peter | Tamer Alkhouli | Andreas Guta | Hermann Ney

pdf bib
Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences
Víctor M. Sánchez-Cartagena | Antonio Toral

pdf bib
Edinburgh Neural Machine Translation Systems for WMT 16
Rico Sennrich | Barry Haddow | Alexandra Birch

pdf bib
The Edit Distance Transducer in Action: The University of Cambridge English-German System at WMT16
Felix Stahlberg | Eva Hasler | Bill Byrne

pdf bib
CUNI-LMU Submissions in WMT2016: Chimera Constrained and Beaten
Aleš Tamchyna | Roman Sudarikov | Ondřej Bojar | Alexander Fraser

pdf bib
Phrase-Based SMT for Finnish with More Data, Better Models and Alternative Alignment and Translation Tools
Jörg Tiedemann | Fabienne Cap | Jenna Kanerva | Filip Ginter | Sara Stymne | Robert Östling | Marion Weller-Di Marco

pdf bib
Edinburgh’s Statistical Machine Translation Systems for WMT16
Philip Williams | Rico Sennrich | Maria Nădejde | Matthias Huck | Barry Haddow | Ondřej Bojar

pdf bib
PJAIT Systems for the WMT 2016
Krzysztof Wolk | Krzysztof Marasek

pdf bib
DFKI’s system for WMT16 IT-domain task, including analysis of systematic errors
Eleftherios Avramidis | Aljoscha Burchardt | Vivien Macketanz | Ankit Srivastava

pdf bib
ILLC-UvA Adaptation System (Scorpio) at WMT’16 IT-DOMAIN Task
Hoang Cuong | Stella Frank | Khalil Sima’an

pdf bib
Data Selection for IT Texts using Paragraph Vector
Mirela-Stefania Duma | Wolfgang Menzel

pdf bib
SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task
Rosa Gaudio | Gorka Labaka | Eneko Agirre | Petya Osenova | Kiril Simov | Martin Popel | Dieke Oele | Gertjan van Noord | Luís Gomes | João António Rodrigues | Steven Neale | João Silva | Andreia Querido | Nuno Rendeiro | António Branco

pdf bib
JU-USAAR: A Domain Adaptive MT System
Koushik Pahari | Alapan Kuila | Santanu Pal | Sudip Kumar Naskar | Sivaji Bandyopadhyay | Josef van Genabith

pdf bib
Dictionary-based Domain Adaptation of MT Systems without Retraining
Rudolf Rosa | Roman Sudarikov | Michal Novák | Martin Popel | Ondřej Bojar

pdf bib
English-Portuguese Biomedical Translation Task Using a Genuine Phrase-Based Statistical Machine Translation Approach
José Aires | Gabriel Lopes | Luís Gomes

pdf bib
The TALPUPC Spanish–English WMT Biomedical Task: Bilingual Embeddings and Char-based Neural Language Model Rescoring in a Phrase-based System
Marta R. Costa-jussà | Cristina España-Bonet | Pranava Madhyastha | Carlos Escolano | José A. R. Fonollosa

pdf bib
LIMSI’s Contribution to the WMT’16 Biomedical Translation Task
Julia Ive | Aurélien Max | François Yvon

pdf bib
IXA Biomedical Translation System at WMT16 Biomedical Translation Task
Olatz Perez-de-Viñaspre | Gorka Labaka

pdf bib
CobaltF: A Fluent Metric for MT Evaluation
Marina Fomicheva | Núria Bel | Lucia Specia | Iria da Cunha | Anton Malinovskiy

pdf bib
DTED: Evaluation of Machine Translation Structure Using Dependency Parsing and Tree Edit Distance
Martin McCaffery | Mark-Jan Nederhof

pdf bib
chrF deconstructed: beta parameters and n-gram weights
Maja Popović

pdf bib
CharacTer: Translation Edit Rate on Character Level
Weiyue Wang | Jan-Thorsten Peter | Hendrik Rosendahl | Hermann Ney

pdf bib
Extract Domain-specific Paraphrase from Monolingual Corpus for Automatic Evaluation of Machine Translation
Lilin Zhang | Zhen Weng | Wenyan Xiao | Jianyi Wan | Zhiming Chen | Yiming Tan | Maoxi Li | Mingwen Wang

pdf bib
Particle Swarm Optimization Submission for WMT16 Tuning Task
Viktor Kocur | Ondřej Bojar

pdf bib
Findings of the 2016 WMT Shared Task on Cross-lingual Pronoun Prediction
Liane Guillou | Christian Hardmeier | Preslav Nakov | Sara Stymne | Jörg Tiedemann | Yannick Versley | Mauro Cettolo | Bonnie Webber | Andrei Popescu-Belis

pdf bib
A Shared Task on Multimodal Machine Translation and Crosslingual Image Description
Lucia Specia | Stella Frank | Khalil Sima’an | Desmond Elliott

pdf bib
Findings of the WMT 2016 Bilingual Document Alignment Shared Task
Christian Buck | Philipp Koehn

pdf bib
Cross-lingual Pronoun Prediction with Linguistically Informed Features
Rachel Bawden

pdf bib
The Kyoto University Cross-Lingual Pronoun Translation System
Raj Dabre | Yevgeniy Puzikov | Fabien Cromieres | Sadao Kurohashi

pdf bib
Pronoun Prediction with Latent Anaphora Resolution
Christian Hardmeier

pdf bib
It-disambiguation and source-aware language models for cross-lingual pronoun prediction
Sharid Loáiciga | Liane Guillou | Christian Hardmeier

pdf bib
Pronoun Language Model and Grammatical Heuristics for Aiding Pronoun Prediction
Ngoc Quang Luong | Andrei Popescu-Belis

pdf bib
Cross-Lingual Pronoun Prediction with Deep Recurrent Neural Networks
Juhani Luotolahti | Jenna Kanerva | Filip Ginter

pdf bib
Pronoun Prediction with Linguistic Features and Example Weighing
Michal Novák

pdf bib
Feature Exploration for Cross-Lingual Pronoun Prediction
Sara Stymne

pdf bib
A Linear Baseline Classifier for Cross-Lingual Pronoun Prediction
Jörg Tiedemann

pdf bib
Cross-lingual Pronoun Prediction for English, French and German with Maximum Entropy Classification
Dominikus Wetzel

pdf bib
Does Multimodality Help Human and Machine for Translation and Image Captioning?
Ozan Caglayan | Walid Aransa | Yaxing Wang | Marc Masana | Mercedes García-Martínez | Fethi Bougares | Loïc Barrault | Joost van de Weijer

pdf bib
DCU-UvA Multimodal MT System Report
Iacer Calixto | Desmond Elliott | Stella Frank

pdf bib
Attention-based Multimodal Neural Machine Translation
Po-Yao Huang | Frederick Liu | Sz-Rung Shiang | Jean Oh | Chris Dyer

pdf bib
CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks
Jindřich Libovický | Jindřich Helcl | Marek Tlustý | Ondřej Bojar | Pavel Pecina

pdf bib
WMT 2016 Multimodal Translation System Description based on Bidirectional Recurrent Neural Networks with Double-Embeddings
Sergio Rodríguez Guasch | Marta R. Costa-jussà

pdf bib
SHEF-Multimodal: Grounding Machine Translation on Images
Kashif Shah | Josiah Wang | Lucia Specia

pdf bib
DOCAL - Vicomtech’s Participation in the WMT16 Shared Task on Bilingual Document Alignment
Andoni Azpeitia | Thierry Etchegoyhen

pdf bib
Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance
Christian Buck | Philipp Koehn

pdf bib
YODA System for WMT16 Shared Task: Bilingual Document Alignment
Aswarth Abhilash Dara | Yiu-Chang Lin

pdf bib
Bitextor’s participation in WMT’16: shared task on document alignment
Miquel Esplà-Gomis | Mikel Forcada | Sergio Ortiz-Rojas | Jorge Ferrández-Tordera

pdf bib
Bilingual Document Alignment with Latent Semantic Indexing
Ulrich Germann

pdf bib
First Steps Towards Coverage-Based Document Alignment
Luís Gomes | Gabriel Pereira Lopes

pdf bib
BAD LUC@WMT 2016: a Bilingual Document Alignment Platform Based on Lucene
Laurent Jakubina | Phillippe Langlais

pdf bib
Using Term Position Similarity and Language Modeling for Bilingual Document Alignment
Thanh C. Le | Hoa Trong Vu | Jonathan Oberländer | Ondřej Bojar

pdf bib
The ADAPT Bilingual Document Alignment system at WMT16
Pintu Lohar | Haithem Afli | Chao-Hong Liu | Andy Way

pdf bib
WMT2016: A Hybrid Approach to Bilingual Document Alignment
Sainik Mahata | Dipankar Das | Santanu Pal

pdf bib
English-French Document Alignment Based on Keywords and Statistical Translation
Marek Medveď | Miloš Jakubíček | Vojtech Kovář

pdf bib
The ILSP/ARC submission to the WMT 2016 Bilingual Document Alignment Shared Task
Vassilis Papavassiliou | Prokopis Prokopidis | Stelios Piperidis

pdf bib
Word Clustering Approach to Bilingual Document Alignment (WMT 2016 Shared Task)
Vadim Shchukin | Dmitry Khristich | Irina Galinskaya

pdf bib
The FBK Participation in the WMT 2016 Automatic Post-editing Shared Task
Rajen Chatterjee | José G. C. de Souza | Matteo Negri | Marco Turchi

pdf bib
Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing
Marcin Junczys-Dowmunt | Roman Grundkiewicz

pdf bib
USAAR: An Operation Sequential Model for Automatic Statistical Post-Editing
Santanu Pal | Marcos Zampieri | Josef van Genabith

pdf bib
Bilingual Embeddings and Word Alignments for Translation Quality Estimation
Amal Abdelsalam | Ondřej Bojar | Samhaa El-Beltagy

pdf bib
SHEF-MIME: Word-level Quality Estimation Using Imitation Learning
Daniel Beck | Andreas Vlachos | Gustavo Paetzold | Lucia Specia

pdf bib
Referential Translation Machines for Predicting Translation Performance
Ergun Biçici

pdf bib
UAlacant word-level and phrase-level machine translation quality estimation systems at WMT 2016
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Mikel Forcada

pdf bib
Recurrent Neural Network based Translation Quality Estimation
Hyun Kim | Jong-Hyeok Lee

pdf bib
YSDA Participation in the WMT’16 Quality Estimation Shared Task
Anna Kozlova | Mariya Shmatova | Anton Frolov

pdf bib
USFD’s Phrase-level Quality Estimation Systems
Varvara Logacheva | Frédéric Blain | Lucia Specia

pdf bib
Unbabel’s Participation in the WMT16 Word-Level Translation Quality Estimation Shared Task
André F. T. Martins | Ramón Astudillo | Chris Hokamp | Fabio Kepler

pdf bib
SimpleNets: Quality Estimation with Resource-Light Neural Networks
Gustavo Paetzold | Lucia Specia

pdf bib
Translation Quality Estimation using Recurrent Neural Network
Raj Nath Patel | Sasikumar M

pdf bib
The UU Submission to the Machine Translation Quality Estimation Task
Oscar Sagemo | Sara Stymne

pdf bib
Word embeddings and discourse information for Quality Estimation
Carolina Scarton | Daniel Beck | Kashif Shah | Karin Sim Smith | Lucia Specia

pdf bib
SHEF-LIUM-NN: Sentence level Quality Estimation with Neural Network Features
Kashif Shah | Fethi Bougares | Loïc Barrault | Lucia Specia

pdf bib
UGENT-LT3 SCATE Submission for WMT16 Shared Task on Quality Estimation
Arda Tezcan | Véronique Hoste | Lieve Macken



up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

pdf bib
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

pdf bib
Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance
Billy Chiu | Anna Korhonen | Sampo Pyysalo

pdf bib
A critique of word similarity as a method for evaluating distributional semantic models
Miroslav Batchkarov | Thomas Kober | Jeremy Reffin | Julie Weeds | David Weir

pdf bib
Issues in evaluating semantic spaces using word analogies
Tal Linzen

pdf bib
Evaluating Word Embeddings Using a Representative Suite of Practical Tasks
Neha Nayak | Gabor Angeli | Christopher D. Manning

pdf bib
Story Cloze Evaluator: Vector Space Representation Evaluation by Predicting What Happens Next
Nasrin Mostafazadeh | Lucy Vanderwende | Wen-tau Yih | Pushmeet Kohli | James Allen

pdf bib
Problems With Evaluation of Word Embeddings Using Word Similarity Tasks
Manaal Faruqui | Yulia Tsvetkov | Pushpendre Rastogi | Chris Dyer

pdf bib
Intrinsic Evaluations of Word Embeddings: What Can We Do Better?
Anna Gladkova | Aleksandr Drozd

pdf bib
Find the word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations
José Camacho-Collados | Roberto Navigli

pdf bib
Capturing Discriminative Attributes in a Distributional Space: Task Proposal
Alicia Krebs | Denis Paperno

pdf bib
An Improved Crowdsourcing Based Evaluation Technique for Word Embedding Methods
Farhana Ferdousi Liza | Marek Grześ

pdf bib
Evaluation of acoustic word embeddings
Sahar Ghannay | Yannick Estève | Nathalie Camelin | Paul Deleglise

pdf bib
Evaluating Embeddings using Syntax-based Classification Tasks as a Proxy for Parser Performance
Arne Köhn

pdf bib
Evaluating vector space models using human semantic priming results
Allyson Ettinger | Tal Linzen

pdf bib
Evaluating embeddings on dictionary-based similarity
Judit Ács | András Kornai

pdf bib
Evaluating multi-sense embeddings for semantic resolution monolingually and in word translation
Gábor Borbély | Márton Makrai | Dávid Márk Nemeskey | András Kornai

pdf bib
Subsumption Preservation as a Comparative Measure for Evaluating Sense-Directed Embeddings
Ali Seyed

pdf bib
Evaluating Informal-Domain Word Representations With UrbanDictionary
Naomi Saphra

pdf bib
Thematic fit evaluation: an aspect of selectional preferences
Asad Sayeed | Clayton Greenberg | Vera Demberg

pdf bib
Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure
Oded Avraham | Yoav Goldberg

pdf bib
Correlation-based Intrinsic Evaluation of Word Vector Representations
Yulia Tsvetkov | Manaal Faruqui | Chris Dyer

pdf bib
Evaluating word embeddings with fMRI and eye-tracking
Anders Søgaard

pdf bib
Defining Words with Words: Beyond the Distributional Hypothesis
Iuliana-Elena Parasca | Andreas Lukas Rauter | Jack Roper | Aleksandar Rusinov | Guillaume Bouchard | Sebastian Riedel | Pontus Stenetorp

pdf bib
A Proposal for Linguistic Similarity Datasets Based on Commonality Lists
Dmitrijs Milajevs | Sascha Griffiths

pdf bib
Probing for semantic evidence of composition by means of simple classification tasks
Allyson Ettinger | Ahmed Elgohary | Philip Resnik

pdf bib
SLEDDED: A Proposed Dataset of Event Descriptions for Evaluating Phrase Representations
Laura Rimell | Eva Maria Vecchi

pdf bib
Sentence Embedding Evaluation Using Pyramid Annotation
Tal Baumel | Raphael Cohen | Michael Elhadad


up

pdf (full)
bib (full)
Proceedings of the 10th Web as Corpus Workshop

pdf bib
Proceedings of the 10th Web as Corpus Workshop
Paul Cook | Stefan Evert | Roland Schäfer | Egon Stemle

pdf bib
Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison
Roland Schäfer | Felix Bildhauer

pdf bib
Efficient construction of metadata-enhanced web corpora
Adrien Barbaresi

pdf bib
Topically-focused Blog Corpora for Multiple Languages
Andrew Salway | Dag Elgesem | Knut Hofland | Øystein Reigem | Lubos Steskal

pdf bib
The Challenges and Joys of Analysing Ongoing Language Change in Web-based Corpora: a Case Study
Anne Krause

pdf bib
Using the Web and Social Media as Corpora for Monitoring the Spread of Neologisms. The case of ‘rapefugee’, ‘rapeugee’, and ‘rapugee’.
Quirin Würschinger | Mohammad Fazleh Elahi | Desislava Zhekova | Hans-Jörg Schmid

pdf bib
EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora
Michael Beißwenger | Sabine Bartsch | Stefan Evert | Kay-Michael Würzner

pdf bib
SoMaJo: State-of-the-art tokenization for German web and social media texts
Thomas Proisl | Peter Uhrig

pdf bib
UdS-(retrain|distributional|surface): Improving POS Tagging for OOV Words in German CMC and Web Data
Jakob Prange | Andrea Horbach | Stefan Thater

pdf bib
Babler - Data Collection from the Web to Support Speech Recognition and Keyword Search
Gideon Mendels | Erica Cooper | Julia Hirschberg

pdf bib
A Global Analysis of Emoji Usage
Nikola Ljubešić | Darja Fišer

pdf bib
Genre classification for a corpus of academic webpages
Erika Dalan | Serge Sharoff

pdf bib
On Bias-free Crawling and Representative Web Corpora
Roland Schäfer

pdf bib
EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres
Steffen Remus | Gerold Hintz | Chris Biemann | Christian M. Meyer | Darina Benikova | Judith Eckle-Kohler | Margot Mieskes | Thomas Arnold

pdf bib
bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)
Egon Stemle

pdf bib
LTL-UDE @ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text
Tobias Horsmann | Torsten Zesch


up

pdf (full)
bib (full)
Proceedings of the Sixth Named Entity Workshop

pdf bib
Proceedings of the Sixth Named Entity Workshop
Xiangyu Duan | Rafael E. Banchs | Min Zhang | Haizhou Li | A Kumaran

pdf bib
Leveraging Entity Linking and Related Language Projection to Improve Name Transliteration
Ying Lin | Xiaoman Pan | Aliya Deri | Heng Ji | Kevin Knight

pdf bib
Multi-source named entity typing for social media
Reuth Vexler | Einat Minkov

pdf bib
Evaluating and Combining Name Entity Recognition Systems
Ridong Jiang | Rafael E. Banchs | Haizhou Li

pdf bib
German NER with a Multilingual Rule Based Information Extraction System: Analysis and Issues
Anna Druzhkina | Alexey Leontyev | Maria Stepanova

pdf bib
Spanish NER with Word Representations and Conditional Random Fields
Jenny Linet Copara Zea | Jose Eduardo Ochoa Luna | Camilo Thorne | Goran Glavaš

pdf bib
Constructing a Japanese Basic Named Entity Corpus of Various Genres
Tomoya Iwakura | Kanako Komiya | Ryuichi Tachibana

pdf bib
Linguistic Issues in the Machine Transliteration of Chinese, Japanese and Arabic Names
Jack Halpern

pdf bib
Whitepaper of NEWS 2016 Shared Task on Machine Transliteration
Xiangyu Duan | Min Zhang | Haizhou Li | Rafael Banchs | A Kumaran

pdf bib
Report of NEWS 2016 Machine Transliteration Shared Task
Xiangyu Duan | Rafael Banchs | Min Zhang | Haizhou Li | A. Kumaran

pdf bib
Applying Neural Networks to English-Chinese Named Entity Transliteration
Yan Shao | Joakim Nivre

pdf bib
Target-Bidirectional Neural Models for Machine Transliteration
Andrew Finch | Lemao Liu | Xiaolin Wang | Eiichiro Sumita

pdf bib
Regulating Orthography-Phonology Relationship for English to Thai Transliteration
Binh Minh Nguyen | Hoang Gia Ngo | Nancy F. Chen

pdf bib
Moses-based official baseline for NEWS 2016
Marta R. Costa-jussà


up

pdf (full)
bib (full)
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)

pdf bib
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)
Chris Reed

pdf bib
“What Is Your Evidence?” A Study of Controversial Topics on Social Media
Aseel Addawood | Masooda Bashir

pdf bib
Summarizing Multi-Party Argumentative Conversations in Reader Comment on News
Emma Barker | Robert Gaizauskas

pdf bib
Argumentative texts and clause types
Maria Becker | Alexis Palmer | Anette Frank

pdf bib
Contextual stance classification of opinions: A step towards enthymeme reconstruction in online reviews
Pavithra Rajendran | Danushka Bollegala | Simon Parsons

pdf bib
The CASS Technique for Evaluating the Performance of Argument Mining
Rory Duthie | John Lawrence | Katarzyna Budzynska | Chris Reed

pdf bib
Extracting Case Law Sentences for Argumentation about the Meaning of Statutory Terms
Jaromír Šavelka | Kevin D. Ashley

pdf bib
Scrutable Feature Sets for Stance Classification
Angrosh Mandya | Advaith Siddharthan | Adam Wyner

pdf bib
Argumentation: Content, Structure, and Relationship with Essay Quality
Beata Beigman Klebanov | Christian Stab | Jill Burstein | Yi Song | Binod Gyawali | Iryna Gurevych

pdf bib
Neural Attention Model for Classification of Sentences that Support Promoting/Suppressing Relationship
Yuta Koreeda | Toshihiko Yanase | Kohsuke Yanai | Misa Sato | Yoshiki Niwa

pdf bib
Towards Feasible Guidelines for the Annotation of Argument Schemes
Elena Musi | Debanjan Ghosh | Smaranda Muresan

pdf bib
Identifying Argument Components through TextRank
Georgios Petasis | Vangelis Karkaletsis

pdf bib
Rhetorical structure and argumentation structure in monologue text
Andreas Peldszus | Manfred Stede

pdf bib
Recognizing the Absence of Opposing Arguments in Persuasive Essays
Christian Stab | Iryna Gurevych

pdf bib
Expert Stance Graphs for Computational Argumentation
Orith Toledo-Ronen | Roy Bar-Haim | Noam Slonim

pdf bib
Fill the Gap! Analyzing Implicit Premises between Claims from Online Debates
Filip Boltužić | Jan Šnajder

pdf bib
Summarising the points made in online political debates
Charlie Egan | Advaith Siddharthan | Adam Wyner

pdf bib
What to Do with an Airport? Mining Arguments in the German Online Participation Project Tempelhofer Feld
Matthias Liebeck | Katharina Esau | Stefan Conrad

pdf bib
Unshared task: (Dis)agreement in online debates
Maria Skeppstedt | Magnus Sahlgren | Carita Paradis | Andreas Kerren

pdf bib
Unshared Task at the 3rd Workshop on Argument Mining: Perspective Based Local Agreement and Disagreement in Online Debate
Chantal van Son | Tommaso Caselli | Antske Fokkens | Isa Maks | Roser Morante | Lora Aroyo | Piek Vossen

pdf bib
A Preliminary Study of Disputation Behavior in Online Debating Forum
Zhongyu Wei | Yandi Xia | Chen Li | Yang Liu | Zachary Stallbohm | Yi Li | Yang Jin


up

pdf (full)
bib (full)
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

pdf bib
Proceedings of the 15th Workshop on Biomedical Natural Language Processing
Kevin Bretonnel Cohen | Dina Demner-Fushman | Sophia Ananiadou | Jun-ichi Tsujii

pdf bib
A Machine Learning Approach to Clinical Terms Normalization
José Castaño | María Laura Gambarte | Hee Joon Park | Maria del Pilar Avila Williams | David Pérez | Fernando Campos | Daniel Luna | Sonia Benítez | Hernán Berinsky | Sofía Zanetti

pdf bib
Improved Semantic Representation for Domain-Specific Entities
Mohammad Taher Pilehvar | Nigel Collier

pdf bib
Identification, characterization, and grounding of gradable terms in clinical text
Chaitanya Shivade | Marie-Catherine de Marneffe | Eric Fosler-Lussier | Albert M. Lai

pdf bib
Graph-based Semi-supervised Gene Mention Tagging
Golnar Sheikhshab | Elizabeth Starks | Aly Karsan | Anoop Sarkar | Inanc Birol

pdf bib
Feature Derivation for Exploitation of Distant Annotation via Pattern Induction against Dependency Parses
Dayne Freitag | John Niekrasz

pdf bib
Inferring Implicit Causal Relationships in Biomedical Literature
Halil Kilicoglu

pdf bib
SnapToGrid: From Statistical to Interpretable Models for Biomedical Information Extraction
Marco A. Valenzuela-Escárcega | Gus Hahn-Powell | Dane Bell | Mihai Surdeanu

pdf bib
Character based String Kernels for Bio-Entity Relation Detection
Ritambhara Singh | Yanjun Qi

pdf bib
Disambiguation of entities in MEDLINE abstracts by combining MeSH terms with knowledge
Amy Siu | Patrick Ernst | Gerhard Weikum

pdf bib
Using Distributed Representations to Disambiguate Biomedical and Clinical Concepts
Stéphan Tulkens | Simon Suster | Walter Daelemans

pdf bib
Unsupervised Document Classification with Informed Topic Models
Timothy Miller | Dmitriy Dligach | Guergana Savova

pdf bib
Vocabulary Development To Support Information Extraction of Substance Abuse from Psychiatry Notes
Sumithra Velupillai | Danielle L. Mowery | Mike Conway | John Hurdle | Brent Kious

pdf bib
Syntactic analyses and named entity recognition for PubMed and PubMed Central — up-to-the-minute
Kai Hakala | Suwisa Kaewphan | Tapio Salakoski | Filip Ginter

pdf bib
Improving Temporal Relation Extraction with Training Instance Augmentation
Chen Lin | Timothy Miller | Dmitriy Dligach | Steven Bethard | Guergana Savova

pdf bib
Using Centroids of Word Embeddings and Word Mover’s Distance for Biomedical Document Retrieval in Question Answering
Georgios-Ioannis Brokos | Prodromos Malakasiotis | Ion Androutsopoulos

pdf bib
Measuring the State of the Art of Automated Pathway Curation Using Graph Algorithms - A Case Study of the mTOR Pathway
Michael Spranger | Sucheendra Palaniappan | Samik Gosh

pdf bib
Construction of a Personal Experience Tweet Corpus for Health Surveillance
Keyuan Jiang | Ricardo Calix | Matrika Gupta

pdf bib
Modelling the Combination of Generic and Target Domain Embeddings in a Convolutional Neural Network for Sentence Classification
Nut Limsopatham | Nigel Collier

pdf bib
PubTermVariants: biomedical term variants and their use for PubMed search
Lana Yeganova | Won Kim | Sun Kim | Rezarta Islamaj Doğan | Wanli Liu | Donald C Comeau | Zhiyong Lu | W John Wilbur

pdf bib
This before That: Causal Precedence in the Biomedical Domain
Gus Hahn-Powell | Dane Bell | Marco A. Valenzuela-Escárcega | Mihai Surdeanu

pdf bib
Syntactic methods for negation detection in radiology reports in Spanish
Viviana Cotik | Vanesa Stricker | Jorge Vivaldi | Horacio Rodriguez

pdf bib
How to Train good Word Embeddings for Biomedical NLP
Billy Chiu | Gamal Crichton | Anna Korhonen | Sampo Pyysalo

pdf bib
An Information Foraging Approach to Determining the Number of Relevant Features
Brian Connolly | Benjamin Glass | John Pestian

pdf bib
Assessing the Feasibility of an Automated Suggestion System for Communicating Critical Findings from Chest Radiology Reports to Referring Physicians
Brian E. Chapman | Danielle L. Mowery | Evan Narasimhan | Neel Patel | Wendy Chapman | Marta Heilbrun

pdf bib
Building a dictionary of lexical variants for phenotype descriptors
Simon Kocbek | Tudor Groza

pdf bib
Applying deep learning on electronic health records in Swedish to predict healthcare-associated infections
Olof Jacobson | Hercules Dalianis

pdf bib
Identifying First Episodes of Psychosis in Psychiatric Patient Records using Machine Learning
Genevieve Gorrell | Sherifat Oduola | Angus Roberts | Tom Craig | Craig Morgan | Rob Stewart

pdf bib
Relation extraction from clinical texts using domain invariant convolutional neural network
Sunil Sahu | Ashish Anand | Krishnadev Oruganty | Mahanandeeshwar Gattu


up

pdf (full)
bib (full)
Proceedings of the 4th BioNLP Shared Task Workshop

pdf bib
Proceedings of the 4th BioNLP Shared Task Workshop
Claire Nėdellec | Robert Bossy | Jin-Dong Kim

pdf bib
Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task 2016.
Estelle Chaix | Bertrand Dubreucq | Abdelhak Fatihi | Dialekti Valsamou | Robert Bossy | Mouhamadou Ba | Louise Deléger | Pierre Zweigenbaum | Philippe Bessières | Loic Lepiniec | Claire Nédellec

pdf bib
Overview of the Bacteria Biotope Task at BioNLP Shared Task 2016
Louise Deléger | Robert Bossy | Estelle Chaix | Mouhamadou Ba | Arnaud Ferré | Philippe Bessières | Claire Nédellec

pdf bib
Refactoring the Genia Event Extraction Shared Task Toward a General Framework for IE-Driven KB Development
Jin-Dong Kim | Yue Wang | Nicola Colic | Seung Han Beak | Yong Hwan Kim | Min Song

pdf bib
LitWay, Discriminative Extraction for Different Bio-Events
Chen Li | Zhiqiang Rao | Xiangrong Zhang

pdf bib
VERSE: Event and Relation Extraction in the BioNLP 2016 Shared Task
Jake Lever | Steven JM Jones

pdf bib
A dictionary- and rule-based system for identification of bacteria and habitats in text
Helen V Cook | Evangelos Pafilis | Lars Juhl Jensen

pdf bib
Ontology-Based Categorization of Bacteria and Habitat Entities using Information Retrieval Techniques
Mert Tiftikci | Hakan Şahin | Berfu Büyüköz | Alper Yayıkçı | Arzucan Özgür

pdf bib
Identification of Mentions and Relations between Bacteria and Biotope from PubMed Abstracts
Cyril Grouin

pdf bib
Deep Learning with Minimal Training Data: TurkuNLP Entry in the BioNLP Shared Task 2016
Farrokh Mehryary | Jari Björne | Sampo Pyysalo | Tapio Salakoski | Filip Ginter

pdf bib
SeeDev Binary Event Extraction using SVMs and a Rich Feature Set
Nagesh C. Panyam | Gitansh Khirbat | Karin Verspoor | Trevor Cohn | Kotagiri Ramamohanarao

pdf bib
Extraction of Regulatory Events using Kernel-based Classifiers and Distant Supervision
Andre Lamurias | Miguel J. Rodrigues | Luka A. Clarke | Francisco M. Couto

pdf bib
DUTIR in BioNLP-ST 2016: Utilizing Convolutional Network and Distributed Representation to Extract Complicate Relations
Honglei Li | Jianhai Zhang | Jian Wang | Hongfei Lin | Zhihao Yang

pdf bib
Extracting Biomedical Event Using Feature Selection and Word Representation
Xinyu He | Lishuang Li | Jieqiong Zheng | Meiyue Qin



up

pdf (full)
bib (full)
Proceedings of the 5th Workshop on Vision and Language

pdf bib
Proceedings of the 5th Workshop on Vision and Language
Anya Belz | Erkut Erdem | Krystian Mikolajczyk | Katerina Pastra

pdf bib
Automatic Annotation of Structured Facts in Images
Mohamed Elhoseiny | Scott Cohen | Walter Chang | Brian Price | Ahmed Elgammal

pdf bib
Combining Lexical and Spatial Knowledge to Predict Spatial Relations between Objects in Images
Manuela Hürlimann | Johan Bos

pdf bib
Focused Evaluation for Image Description with Binary Forced-Choice Tasks
Micah Hodosh | Julia Hockenmaier

pdf bib
Leveraging Captions in the Wild to Improve Object Detection
Mert Kilickaya | Nazli Ikizler-Cinbis | Erkut Erdem | Aykut Erdem

pdf bib
Natural Language Descriptions of Human Activities Scenes: Corpus Generation and Analysis
Nouf Alharbi | Yoshihiko Gotoh

pdf bib
Interactively Learning Visually Grounded Word Meanings from a Human Tutor
Yanchao Yu | Arash Eshghi | Oliver Lemon

pdf bib
Pragmatic Factors in Image Description: The Case of Negations
Emiel van Miltenburg | Roser Morante | Desmond Elliott

pdf bib
Building a Bagpipe with a Bag and a Pipe: Exploring Conceptual Combination in Vision
Sandro Pezzelle | Ravi Shekhar | Raffaella Bernardi

pdf bib
Exploring Different Preposition Sets, Models and Feature Sets in Automatic Generation of Spatial Image Descriptions
Anja Belz | Adrian Muscat | Brandon Birmingham

pdf bib
Multi30K: Multilingual English-German Image Descriptions
Desmond Elliott | Stella Frank | Khalil Sima’an | Lucia Specia

pdf bib
“Look, some Green Circles!”: Learning to Quantify from Images
Ionut Sorodoc | Angeliki Lazaridou | Gemma Boleda | Aurélie Herbelot | Sandro Pezzelle | Raffaella Bernardi

pdf bib
Text2voronoi: An Image-driven Approach to Differential Diagnosis
Alexander Mehler | Tolga Uslu | Wahed Hemati

pdf bib
Detecting Visually Relevant Sentences for Fine-Grained Classification
Olivia Winn | Madhavan Kavanur Kidambi | Smaranda Muresan


up

pdf (full)
bib (full)
Proceedings of the 12th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+12)

pdf bib
Proceedings of the 12th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+12)
David Chiang | Alexander Koller

pdf bib
Coordination in Minimalist Grammars: Excorporation and Across the Board (Head) Movement
John Torr | Edward P. Stabler

pdf bib
ArabTAG: from a Handcrafted to a Semi-automatically Generated TAG
Chérifa Ben Khelil | Denys Duchier | Yannick Parmentier | Chiraz Zribi | Fériel Ben Fraj

pdf bib
Interfacing Sentential and Discourse TAG-based Grammars
Laurence Danlos | Aleksandre Maskharashvili | Sylvain Pogodalla

pdf bib
Modelling Discourse in STAG: Subordinate Conjunctions and Attributing Phrases
Timothée Bernard | Laurence Danlos

pdf bib
Argument linking in LTAG: A constraint-based implementation with XMG
Laura Kallmeyer | Timm Lichte | Rainer Osswald | Simon Petitjean

pdf bib
Verbal fields in Hungarian simple sentences and infinitival clausal complements
Kata Balogh

pdf bib
Modelling the ziji Blocking Effect and Constraining Bound Variable Derivations in MC-TAG with Delayed Locality
Dennis Ryan Storoshenko

pdf bib
Node-based Induction of Tree-Substitution Grammars
Rose Sloan

pdf bib
Revisiting Supertagging and Parsing: How to Use Supertags in Transition-Based Parsing
Wonchang Chung | Suhas Siddhesh Mhatre | Alexis Nasr | Owen Rambow | Srinivas Bangalore

pdf bib
An Alternate View on Strong Lexicalization in TAG
Aniello De Santo | Alëna Aksënova | Thomas Graf

pdf bib
Hyperedge Replacement and Nonprojective Dependency Structures
Daniel Bauer | Owen Rambow

pdf bib
Parasitic Gaps and the Heterogeneity of Dependency Formation in STAG
Dennis Ryan Storoshenko | Robert Frank


up

bib (full) Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf bib
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf bib
Patterns of Terminological Variation in Post-editing and of Cognate Use in Machine Translation in Contrast to Human Translation
Oliver Čulo | Jean Nitzke

pdf bib
Graphonological Levenshtein Edit Distance: Application for Automated Cognate Identification
Bogdan Babych

pdf bib
Improving Phrase-Based SMT Using Cross-Granularity Embedding Similarity
Peyman Passban | Chris Hokamp | Andy Way | Qun Liu

pdf bib
Comparing Translator Acceptability of TM and SMT Outputs
Joss Moorkens | Andy Way

pdf bib
Stand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distribution
Mikel L. Forcada | Miquel Esplà-Gomis | Juan Antonio Pérez-Ortiz

pdf bib
Combining Translation Memories and Syntax-Based SMT: Experiments with Real Industrial Data
Liangyou Li | Carla Parra Escartin | Qun Liu

pdf bib
The Trouble with Machine Translation Coherence
Karin Sim Smith | Wilker Aziz | Lucia Specia

pdf bib
Pivoting Methods and Data for Czech-Vietnamese Translation via English
Duc Tam Hoang | Ondrej Bojar

pdf bib
Detecting Grammatical Errors in Machine Translation Output Using Dependency Parsing and Treebank Querying
Arda Tezcan | Veronique Hoste | Lieve Macken

pdf bib
Potential and Limits of Using Post-edits as Reference Translations for MT Evaluation
Maja Popovic | Mihael Arčan | Arle Lommel

pdf bib
Can Text Simplification Help Machine Translation?
Sanja Štajner | Maja Popovic

pdf bib
A Portable Method for Parallel and Comparable Document Alignment
Thierry Etchegoyhen | Andoni Azpeitia

pdf bib
Semantic Textual Similarity in Quality Estimation
Hanna Bechara | Carla Parra Escartin | Constantin Orasan | Lucia Specia

pdf bib
Climbing Mont BLEU: The Strange World of Reachable High-BLEU Translations
Aaron Smith | Christian Hardmeier | Joerg Tiedemann

pdf bib
Interactive-Predictive Translation Based on Multiple Word-Segments
Miguel Domingo | Alvaro Peris | Francisco Casacuberta

pdf bib
A Contextual Language Model to Improve Machine Translation of Pronouns by Re-ranking Translation Hypotheses
Ngoc Quang Luong | Andrei Popescu-Belis

pdf bib
Predicting and Using Implicit Discourse Elements in Chinese-English Translation
David Steele | Lucia Specia

pdf bib
A Graphical Pronoun Analysis Tool for the PROTEST Pronoun Evaluation Test Suite
Christian Hardmeier | Liane Guillou

pdf bib
Measuring Cognitive Translation Effort with Activity Units
Moritz Jonas Schaeffer | Michael Carl | Isabel Lacruz | Akiko Aizawa

pdf bib
A Comparative Study of Post-editing Guidelines
Ke Hu | Patrick Cadwell

pdf bib
Dealing with Data Sparseness in SMT with Factured Models and Morphological Expansion: a Case Study on Croatian
Victor M. Sánchez-Cartagena | Nikola Ljubešić | Filip Klubička

pdf bib
Collaborative Development of a Rule-Based Machine Translator between Croatian and Serbian
Filip Klubička | Gema Ramírez-Sánchez | Nikola Ljubešić

pdf bib
Re-assessing the Impact of SMT Techniques with Human Evaluation: a Case Study on English—Croatian
Antonio Toral | Raphael Rubino | Gema Ramírez-Sánchez

pdf bib
Proceedings of the 19th Annual Conference of the EAMT: Projects/Products
European Association for Machine Translation


up

pdf (full)
bib (full)
Proceedings of the 2nd International Workshop on Natural Language Generation and the Semantic Web (WebNLG 2016)

pdf bib
Proceedings of the 2nd International Workshop on Natural Language Generation and the Semantic Web (WebNLG 2016)
Claire Gardent | Aldo Gangemi

pdf bib
Generating sets of related sentences from input seed features
Cristina Barros | Elena Lloret

pdf bib
A Repository of Frame Instance Lexicalizations for Generation
Valerio Basile

pdf bib
Processing Document Collections to Automatically Extract Linked Data: Semantic Storytelling Technologies for Smart Curation Workflows
Peter Bourgonje | Julian Moreno Schneider | Georg Rehm | Felix Sasaki

pdf bib
On the Robustness of Standalone Referring Expression Generation Algorithms Using RDF Data
Pablo Duboue | Martin Ariel Domínguez | Paula Estrella

pdf bib
Content Selection through Paraphrase Detection: Capturing different Semantic Realisations of the Same Idea
Elena Lloret | Claire Gardent

pdf bib
Aligning Texts and Knowledge Bases with Semantic Sentence Simplification
Yassine Mrabet | Pavlos Vougiouklis | Halil Kilicoglu | Claire Gardent | Dina Demner-Fushman | Jonathon Hare | Elena Simperl

pdf bib
Building a System for Stock News Generation in Russian
Liubov Nesterenko

pdf bib
Content selection as semantic-based ontology exploration
Laura Perez-Beltrachini | Claire Gardent | Anselme Revuz | Saptarashmi Bandyopadhyay

pdf bib
ReadME generation from an OWL ontology describing NLP tools
Driss Sadoun | Satenik Mkhitaryan | Damien Nouvel | Mathieu Valette

pdf bib
Comparing the Template-Based Approach to GF: the case of Afrikaans
Lauren Sanby | Ion Todd | Maria C. Keet

pdf bib
Generating Paraphrases from DBPedia using Deep Learning
Amin Sleimi | Claire Gardent

pdf bib
Automatic Tweet Generation From Traffic Incident Data
Khoa Tran | Fred Popowich

pdf bib
Analysing the Integration of Semantic Web Features for Document Planning across Genres
Marta Vicente | Elena Lloret


up

pdf (full)
bib (full)
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Raquel Fernandez | Wolfgang Minker | Giuseppe Carenini | Ryuichiro Higashinaka | Ron Artstein | Alesia Gainer

pdf bib
Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning
Tiancheng Zhao | Maxine Eskenazi

pdf bib
Task Lineages: Dialog State Tracking for Flexible Interaction
Sungjin Lee | Amanda Stent

pdf bib
Joint Online Spoken Language Understanding and Language Modeling With Recurrent Neural Networks
Bing Liu | Ian Lane

pdf bib
Creating and Characterizing a Diverse Corpus of Sarcasm in Dialogue
Shereen Oraby | Vrindavan Harrison | Lena Reed | Ernesto Hernandez | Ellen Riloff | Marilyn Walker

pdf bib
The SENSEI Annotated Corpus: Human Summaries of Reader Comment Conversations in On-line News
Emma Barker | Monica Lestari Paramita | Ahmet Aker | Emina Kurtic | Mark Hepple | Robert Gaizauskas

pdf bib
Special Session - The Future Directions of Dialogue-Based Intelligent Personal Assistants
Yoichi Matsuyama | Alexandros Papangelis

pdf bib
Keynote - More than meets the ear: Processes that shape dialogue
Susan Brennan

pdf bib
A Wizard-of-Oz Study on A Non-Task-Oriented Dialog Systems That Reacts to User Engagement
Zhou Yu | Leah Nicolich-Henkin | Alan W Black | Alexander Rudnicky

pdf bib
Classifying Emotions in Customer Support Dialogues in Social Media
Jonathan Herzig | Guy Feigenblat | Michal Shmueli-Scheuer | David Konopnicki | Anat Rafaeli | Daniel Altman | David Spivak

pdf bib
Cultural Communication Idiosyncrasies in Human-Computer Interaction
Juliana Miehle | Koichiro Yoshino | Louisa Pragst | Stefan Ultes | Satoshi Nakamura | Wolfgang Minker

pdf bib
Using phone features to improve dialogue state tracking generalisation to unseen states
Iñigo Casanueva | Thomas Hain | Mauro Nicolao | Phil Green

pdf bib
Character Identification on Multiparty Conversation: Identifying Mentions of Characters in TV Shows
Yu-Hsin Chen | Jinho D. Choi

pdf bib
Policy Networks with Two-Stage Training for Dialogue Systems
Mehdi Fatemi | Layla El Asri | Hannes Schulz | Jing He | Kaheer Suleman

pdf bib
Language Portability for Dialogue Systems: Translating a Question-Answering System from English into Tamil
Satheesh Ravi | Ron Artstein

pdf bib
Extracting PDTB Discourse Relations from Student Essays
Kate Forbes-Riley | Fan Zhang | Diane Litman

pdf bib
Empirical comparison of dependency conversions for RST discourse trees
Katsuhiko Hayashi | Tsutomu Hirao | Masaaki Nagata

pdf bib
The Role of Discourse Units in Near-Extractive Summarization
Junyi Jessy Li | Kapil Thadani | Amanda Stent

pdf bib
Initiations and Interruptions in a Spoken Dialog System
Leah Nicolich-Henkin | Carolyn Rosé | Alan W Black

pdf bib
Analyzing Post-dialogue Comments by Speakers – How Do Humans Personalize Their Utterances in Dialogue? –
Toru Hirano | Ryuichiro Higashinaka | Yoshihiro Matsuo

pdf bib
On the Contribution of Discourse Structure on Text Complexity Assessment
Elnaz Davoodi | Leila Kosseim

pdf bib
Syntactic parsing of chat language in contact center conversation corpus
Alexis Nasr | Geraldine Damnati | Aleksandra Guerraz | Frederic Bechet

pdf bib
A Context-aware Natural Language Generator for Dialogue Systems
Ondřej Dušek | Filip Jurčíček

pdf bib
Identifying Teacher Questions Using Automatic Speech Recognition in Classrooms
Nathaniel Blanchard | Patrick Donnelly | Andrew M. Olney | Borhan Samei | Brooke Ward | Xiaoyi Sun | Sean Kelly | Martin Nystrand | Sidney K. D’Mello

pdf bib
A framework for the automatic inference of stochastic turn-taking styles
Kornel Laskowski

pdf bib
Talking with ERICA, an autonomous android
Koji Inoue | Pierrick Milhorat | Divesh Lala | Tianyu Zhao | Tatsuya Kawahara

pdf bib
Rapid Prototyping of Form-driven Dialogue Systems Using an Open-source Framework
Svetlana Stoyanchev | Pierre Lison | Srinivas Bangalore

pdf bib
LVCSR System on a Hybrid GPU-CPU Embedded Platform for Real-Time Dialog Applications
Alexei V. Ivanov | Patrick L. Lange | David Suendermann-Oeft

pdf bib
Socially-Aware Animated Intelligent Personal Assistant Agent
Yoichi Matsuyama | Arjun Bhardwaj | Ran Zhao | Oscar Romeo | Sushma Akoju | Justine Cassell

pdf bib
Selection method of an appropriate response in chat-oriented dialogue systems
Hideaki Mori | Masahiro Araki

pdf bib
Real-Time Understanding of Complex Discriminative Scene Descriptions
Ramesh Manuvinakurike | Casey Kennington | David DeVault | David Schlangen

pdf bib
Supporting Spoken Assistant Systems with a Graphical User Interface that Signals Incremental Understanding and Prediction State
Casey Kennington | David Schlangen

pdf bib
Toward incremental dialogue act segmentation in fast-paced interactive dialogue systems
Ramesh Manuvinakurike | Maike Paetzel | Cheng Qu | David Schlangen | David DeVault

pdf bib
Keynote - Modeling Human Communication Dynamics
Louis-Philippe Morency

pdf bib
On the Evaluation of Dialogue Systems with Next Utterance Classification
Ryan Lowe | Iulian Vlad Serban | Michael Noseworthy | Laurent Charlin | Joelle Pineau

pdf bib
Towards Using Conversations with Spoken Dialogue Systems in the Automated Assessment of Non-Native Speakers of English
Diane Litman | Steve Young | Mark Gales | Kate Knill | Karen Ottewell | Rogier van Dalen | David Vandyke

pdf bib
Measuring the Similarity of Sentential Arguments in Dialogue
Amita Misra | Brian Ecker | Marilyn Walker

pdf bib
Investigating Fluidity for Human-Robot Interaction with Real-time, Real-world Grounding Strategies
Julian Hough | David Schlangen

pdf bib
Do Characters Abuse More Than Words?
Yashar Mehdad | Joel Tetreault

pdf bib
Towards a dialogue system that supports rich visualizations of data
Abhinav Kumar | Jillian Aurisano | Barbara Di Eugenio | Andrew Johnson | Alberto Gonzalez | Jason Leigh

pdf bib
Analyzing the Effect of Entrainment on Dialogue Acts
Masahiro Mizukami | Koichiro Yoshino | Graham Neubig | David Traum | Satoshi Nakamura

pdf bib
Towards an Entertaining Natural Language Generation System: Linguistic Peculiarities of Japanese Fictional Characters
Chiaki Miyazaki | Toru Hirano | Ryuichiro Higashinaka | Yoshihiro Matsuo

pdf bib
Reference Resolution in Situated Dialogue with Learned Semantics
Xiaolong Li | Kristy Boyer

pdf bib
Training an adaptive dialogue policy for interactive learning of visually grounded word meanings
Yanchao Yu | Arash Eshghi | Oliver Lemon

pdf bib
Learning Fine-Grained Knowledge about Contingent Relations between Everyday Events
Elahe Rahimtoroghi | Ernesto Hernandez | Marilyn Walker

pdf bib
When do we laugh?
Ye Tian | Chiara Mazzocconi | Jonathan Ginzburg

pdf bib
Small Talk Improves User Impressions of Interview Dialogue Systems
Takahiro Kobori | Mikio Nakano | Tomoaki Nakamura

pdf bib
Automatic Recognition of Conversational Strategies in the Service of a Socially-Aware Dialog System
Ran Zhao | Tanmay Sinha | Alan Black | Justine Cassell

pdf bib
Neural Utterance Ranking Model for Conversational Dialogue Systems
Michimasa Inaba | Kenichi Takahashi

pdf bib
Strategy and Policy Learning for Non-Task-Oriented Conversational Systems
Zhou Yu | Ziyu Xu | Alan W Black | Alexander Rudnicky


up

pdf (full)
bib (full)
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

pdf bib
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
Dekai Wu | Pushpak Bhattacharyya

pdf bib
Compound Type Identification in Sanskrit: What Roles do the Corpus and Grammar Play?
Amrith Krishna | Pavankumar Satuluri | Shubham Sharma | Apurv Kumar | Pawan Goyal

We propose a classification framework for semantic type identification of compounds in Sanskrit. We broadly classify the compounds into four different classes namely, Avyayībhāva, Tatpuruṣa, Bahuvrīhi and Dvandva. Our classification is based on the traditional classification system followed by the ancient grammar treatise Adṣṭādhyāyī, proposed by Pāṇini 25 centuries back. We construct an elaborate features space for our system by combining conditional rules from the grammar Adṣṭādhyāyī, semantic relations between the compound components from a lexical database Amarakoṣa and linguistic structures from the data using Adaptor Grammars. Our in-depth analysis of the feature space highlight inadequacy of Adṣṭādhyāyī, a generative grammar, in classifying the data samples. Our experimental results validate the effectiveness of using lexical databases as suggested by Amba Kulkarni and Anil Kumar, and put forward a new research direction by introducing linguistic patterns obtained from Adaptor grammars for effective identification of compound type. We utilise an ensemble based approach, specifically designed for handling skewed datasets and we %and Experimenting with various classification methods, we achieve an overall accuracy of 0.77 using random forest classifiers.

pdf bib
Comparison of Grapheme-to-Phoneme Conversion Methods on a Myanmar Pronunciation Dictionary
Ye Kyaw Thu | Win Pa Pa | Yoshinori Sagisaka | Naoto Iwahashi

Grapheme-to-Phoneme (G2P) conversion is the task of predicting the pronunciation of a word given its graphemic or written form. It is a highly important part of both automatic speech recognition (ASR) and text-to-speech (TTS) systems. In this paper, we evaluate seven G2P conversion approaches: Adaptive Regularization of Weight Vectors (AROW) based structured learning (S-AROW), Conditional Random Field (CRF), Joint-sequence models (JSM), phrase-based statistical machine translation (PBSMT), Recurrent Neural Network (RNN), Support Vector Machine (SVM) based point-wise classification, Weighted Finite-state Transducers (WFST) on a manually tagged Myanmar phoneme dictionary. The G2P bootstrapping experimental results were measured with both automatic phoneme error rate (PER) calculation and also manual checking in terms of voiced/unvoiced, tones, consonant and vowel errors. The result shows that CRF, PBSMT and WFST approaches are the best performing methods for G2P conversion on Myanmar language.

pdf bib
Character-Aware Neural Networks for Arabic Named Entity Recognition for Social Media
Mourad Gridach

Named Entity Recognition (NER) is the task of classifying or labelling atomic elements in the text into categories such as Person, Location or Organisation. For Arabic language, recognizing named entities is a challenging task because of the complexity and the unique characteristics of this language. In addition, most of the previous work focuses on Modern Standard Arabic (MSA), however, recognizing named entities in social media is becoming more interesting these days. Dialectal Arabic (DA) and MSA are both used in social media, which is deemed as another challenging task. Most state-of-the-art Arabic NER systems count heavily on handcrafted engineering features and lexicons which is time consuming. In this paper, we introduce a novel neural network architecture which benefits both from character- and word-level representations automatically, by using combination of bidirectional LSTM and Conditional Random Field (CRF), eliminating the need for most feature engineering. Moreover, our model relies on unsupervised word representations learned from unannotated corpora. Experimental results demonstrate that our model achieves state-of-the-art performance on publicly available benchmark for Arabic NER for social media and surpassing the previous system by a large margin.

pdf bib
Development of a Bengali parser by cross-lingual transfer from Hindi
Ayan Das | Agnivo Saha | Sudeshna Sarkar

In recent years there has been a lot of interest in cross-lingual parsing for developing treebanks for languages with small or no annotated treebanks. In this paper, we explore the development of a cross-lingual transfer parser from Hindi to Bengali using a Hindi parser and a Hindi-Bengali parallel corpus. A parser is trained and applied to the Hindi sentences of the parallel corpus and the parse trees are projected to construct probable parse trees of the corresponding Bengali sentences. Only about 14% of these trees are complete (transferred trees contain all the target sentence words) and they are used to construct a Bengali parser. We relax the criteria of completeness to consider well-formed trees (43% of the trees) leading to an improvement. We note that the words often do not have a one-to-one mapping in the two languages but considering sentences at the chunk-level results in better correspondence between the two languages. Based on this we present a method to use chunking as a preprocessing step and do the transfer on the chunk trees. We find that about 72% of the projected parse trees of Bengali are now well-formed. The resultant parser achieves significant improvement in both Unlabeled Attachment Score (UAS) as well as Labeled Attachment Score (LAS) over the baseline word-level transferred parser.

pdf bib
Sinhala Short Sentence Similarity Calculation using Corpus-Based and Knowledge-Based Similarity Measures
Jcs Kadupitiya | Surangika Ranathunga | Gihan Dias

Currently, corpus based-similarity, string-based similarity, and knowledge-based similarity techniques are used to compare short phrases. However, no work has been conducted on the similarity of phrases in Sinhala language. In this paper, we present a hybrid methodology to compute the similarity between two Sinhala sentences using a Semantic Similarity Measurement technique (corpus-based similarity measurement plus knowledge-based similarity measurement) that makes use of word order information. Since Sinhala WordNet is still under construction, we used lexical resources in performing this semantic similarity calculation. Evaluation using 4000 sentence pairs yielded an average MSE of 0.145 and a Pearson correla-tion factor of 0.832.

pdf bib
Enriching Source for English-to-Urdu Machine Translation
Bushra Jawaid | Amir Kamran | Ondřej Bojar

This paper focuses on the generation of case markers for free word order languages that use case markers as phrasal clitics for marking the relationship between the dependent-noun and its head. The generation of such clitics becomes essential task especially when translating from fixed word order languages where syntactic relations are identified by the positions of the dependent-nouns. To address the problem of missing markers on source-side, artificial markers are added in source to improve alignments with its target counterparts. Up to 1 BLEU point increase is observed over the baseline on different test sets for English-to-Urdu.

pdf bib
The IMAGACT4ALL Ontology of Animated Images: Implications for Theoretical and Machine Translation of Action Verbs from English-Indian Languages
Pitambar Behera | Sharmin Muzaffar | Atul Ku. Ojha | Girish Jha

Action verbs are one of the frequently occurring linguistic elements in any given natural language as the speakers use them during every linguistic intercourse. However, each language expresses action verbs in its own inherently unique manner by categorization. One verb can refer to several interpretations of actions and one action can be expressed by more than one verb. The inter-language and intra-language variations create ambiguity for the translation of languages from the source language to target language with respect to action verbs. IMAGACT is a corpus-based ontological platform of action verbs translated from prototypic animated images explained in English and Italian as meta-languages. In this paper, we are presenting the issues and challenges in translating action verbs of Indian languages as target and English as source language by observing the animated images. Among the ten Indian languages which have been annotated so far on the platform are Sanskrit, Hindi, Urdu, Odia (Oriya), Bengali, Manipuri, Tamil, Assamese, Magahi and Marathi. Out of them, Manipuri belongs to the Sino-Tibetan, Tamil comes off the Dravidian and the rest owe their genesis to the Indo-Aryan language family. One of the issues is that the one-word morphological English verbs are translated into most of the Indian languages as verbs having more than one-word form; for instance as in the case of conjunct, compound, serial verbs and so on. We are further presenting a cross-lingual comparison of action verbs among Indian languages. In addition, we are also dealing with the issues in disambiguating animated images by the L1 native speakers using competence-based judgements and the theoretical and machine translation implications they bear.

pdf bib
Crowdsourcing-based Annotation of Emotions in Filipino and English Tweets
Fermin Roberto Lapitan | Riza Theresa Batista-Navarro | Eliezer Albacea

The automatic analysis of emotions conveyed in social media content, e.g., tweets, has many beneficial applications. In the Philippines, one of the most disaster-prone countries in the world, such methods could potentially enable first responders to make timely decisions despite the risk of data deluge. However, recognising emotions expressed in Philippine-generated tweets, which are mostly written in Filipino, English or a mix of both, is a non-trivial task. In order to facilitate the development of natural language processing (NLP) methods that will automate such type of analysis, we have built a corpus of tweets whose predominant emotions have been manually annotated by means of crowdsourcing. Defining measures ensuring that only high-quality annotations were retained, we have produced a gold standard corpus of 1,146 emotion-labelled Filipino and English tweets. We validate the value of this manually produced resource by demonstrating that an automatic emotion-prediction method based on the use of a publicly available word-emotion association lexicon was unable to reproduce the labels assigned via crowdsourcing. While we are planning to make a few extensions to the corpus in the near future, its current version has been made publicly available in order to foster the development of emotion analysis methods based on advanced Filipino and English NLP.

pdf bib
Sentiment Analysis of Tweets in Three Indian Languages
Shanta Phani | Shibamouli Lahiri | Arindam Biswas

In this paper, we describe the results of sentiment analysis on tweets in three Indian languages – Bengali, Hindi, and Tamil. We used the recently released SAIL dataset (Patra et al., 2015), and obtained state-of-the-art results in all three languages. Our features are simple, robust, scalable, and language-independent. Further, we show that these simple features provide better results than more complex and language-specific features, in two separate classification tasks. Detailed feature analysis and error analysis have been reported, along with learning curves for Hindi and Bengali.

pdf bib
Dealing with Linguistic Divergences in English-Bhojpuri Machine Translation
Pitambar Behera | Neha Mourya | Vandana Pandey

In Machine Translation, divergence is one of the major barriers which plays a deciding role in determining the efficiency of the system at hand. Translation divergences originate when there is structural discrepancies between the input and the output languages. It can be of various types based on the issues we are addressing to such as linguistic, cultural, communicative and so on. Owing to the fact that two languages owe their origin to different language families, linguistic divergences emerge. The present study attempts at categorizing different types of linguistic divergences: the lexical-semantic and syntactic. In addition, it also helps identify and resolve the divergent linguistic features between English as source language and Bhojpuri as target language pair. Dorr’s theoretical framework (1994, 1994a) has been followed in the classification and resolution procedure. Furthermore, so far as the methodology is concerned, we have adhered to the Dorr’s Lexical Conceptual Structure for the resolution of divergences. This research will prove to be beneficial for developing efficient MT systems if the mentioned factors are incorporated considering the inherent structural constraints between source and target languages.ated considering the inherent structural constraints between SL and TL pairs.

pdf bib
The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese
Miki Nishioka | Shiro Akasegawa

In this paper, we discuss our creation of a web corpus of spoken Hindi (COSH), one of the Indo-Aryan languages spoken mainly in the Indian subcontinent. We also point out notable problems we’ve encountered in the web corpus and the special concordancer. After observing the kind of technical problems we encountered, especially regarding annotation tagged by Shiva Reddy’s tagger, we argue how they can be solved when using COSH for linguistic studies. Finally, we mention the kinds of linguistic research that we non-native speakers of Hindi can do using the corpus, especially in pragmatics and semantics, and from a comparative viewpoint to Japanese.

pdf bib
Automatic Creation of a Sentence Aligned Sinhala-Tamil Parallel Corpus
Riyafa Abdul Hameed | Nadeeshani Pathirennehelage | Anusha Ihalapathirana | Maryam Ziyad Mohamed | Surangika Ranathunga | Sanath Jayasena | Gihan Dias | Sandareka Fernando

A sentence aligned parallel corpus is an important prerequisite in statistical machine translation. However, manual creation of such a parallel corpus is time consuming, and requires experts fluent in both languages. Automatic creation of a sentence aligned parallel corpus using parallel text is the solution to this problem. In this paper, we present the first ever empirical evaluation carried out to identify the best method to automatically create a sentence aligned Sinhala-Tamil parallel corpus. Annual reports from Sri Lankan government institutions were used as the parallel text for aligning. Despite both Sinhala and Tamil being under-resourced languages, we were able to achieve an F-score value of 0.791 using a hybrid approach that makes use of a bilingual dictionary.

pdf bib
Clustering-based Phonetic Projection in Mismatched Crowdsourcing Channels for Low-resourced ASR
Wenda Chen | Mark Hasegawa-Johnson | Nancy Chen | Preethi Jyothi | Lav Varshney

Acquiring labeled speech for low-resource languages is a difficult task in the absence of native speakers of the language. One solution to this problem involves collecting speech transcriptions from crowd workers who are foreign or non-native speakers of a given target language. From these mismatched transcriptions, one can derive probabilistic phone transcriptions that are defined over the set of all target language phones using a noisy channel model. This paper extends prior work on deriving probabilistic transcriptions (PTs) from mismatched transcriptions by 1) modelling multilingual channels and 2) introducing a clustering-based phonetic mapping technique to improve the quality of PTs. Mismatched crowdsourcing for multilingual channels has certain properties of projection mapping, e.g., it can be interpreted as a clustering based on singular value decomposition of the segment alignments. To this end, we explore the use of distinctive feature weights, lexical tone confusions, and a two-step clustering algorithm to learn projections of phoneme segments from mismatched multilingual transcriber languages to the target language. We evaluate our techniques using mismatched transcriptions for Cantonese speech acquired from native English and Mandarin speakers. We observe a 5-9% relative reduction in phone error rate for the predicted Cantonese phone transcriptions using our proposed techniques compared with the previous PT method.

pdf bib
Improving the Morphological Analysis of Classical Sanskrit
Oliver Hellwig

The paper describes a new tagset for the morphological disambiguation of Sanskrit, and compares the accuracy of two machine learning methods (Conditional Random Fields, deep recurrent neural networks) for this task, with a special focus on how to model the lexicographic information. It reports a significant improvement over previously published results.

pdf bib
Query Translation for Cross-Language Information Retrieval using Multilingual Word Clusters
Paheli Bhattacharya | Pawan Goyal | Sudeshna Sarkar

In Cross-Language Information Retrieval, finding the appropriate translation of the source language query has always been a difficult problem to solve. We propose a technique towards solving this problem with the help of multilingual word clusters obtained from multilingual word embeddings. We use word embeddings of the languages projected to a common vector space on which a community-detection algorithm is applied to find clusters such that words that represent the same concept from different languages fall in the same group. We utilize these multilingual word clusters to perform query translation for Cross-Language Information Retrieval for three languages - English, Hindi and Bengali. We have experimented with the FIRE 2012 and Wikipedia datasets and have shown improvements over several standard methods like dictionary-based method, a transliteration-based model and Google Translate.

pdf bib
A study of attention-based neural machine translation model on Indian languages
Ayan Das | Pranay Yerra | Ken Kumar | Sudeshna Sarkar

Neural machine translation (NMT) models have recently been shown to be very successful in machine translation (MT). The use of LSTMs in machine translation has significantly improved the translation performance for longer sentences by being able to capture the context and long range correlations of the sentences in their hidden layers. The attention model based NMT system (Bahdanau et al., 2014) has become the state-of-the-art, performing equal or better than other statistical MT approaches. In this paper, we wish to study the performance of the attention-model based NMT system (Bahdanau et al., 2014) on the Indian language pair, Hindi and Bengali, and do an analysis on the types or errors that occur in case when the languages are morphologically rich and there is a scarcity of large parallel training corpus. We then carry out certain post-processing heuristic steps to improve the quality of the translated statements and suggest further measures that can be carried out.

pdf bib
Comprehensive Part-Of-Speech Tag Set and SVM based POS Tagger for Sinhala
Sandareka Fernando | Surangika Ranathunga | Sanath Jayasena | Gihan Dias

This paper presents a new comprehensive multi-level Part-Of-Speech tag set and a Support Vector Machine based Part-Of-Speech tagger for the Sinhala language. The currently available tag set for Sinhala has two limitations: the unavailability of tags to represent some word classes and the lack of tags to capture inflection based grammatical variations of words. The new tag set, presented in this paper overcomes both of these limitations. The accuracy of available Sinhala Part-Of-Speech taggers, which are based on Hidden Markov Models, still falls far behind state of the art. Our Support Vector Machine based tagger achieved an overall accuracy of 84.68% with 59.86% accuracy for unknown words and 87.12% for known words, when the test set contains 10% of unknown words.

pdf bib
Align Me: A framework to generate Parallel Corpus Using OCRs and Bilingual Dictionaries
Priyam Bakliwal | Devadath V V | C V Jawahar

Multilingual language processing tasks like statistical machine translation and cross language information retrieval rely mainly on availability of accurate parallel corpora. Manual construction of such corpus can be extremely expensive and time consuming. In this paper we present a simple yet efficient method to generate huge amount of reasonably accurate parallel corpus with minimal user efforts. We utilize the availability of large number of English books and their corresponding translations in other languages to build parallel corpus. Optical Character Recognizing systems are used to digitize such books. We propose a robust dictionary based parallel corpus generation system for alignment of multilingual text at different levels of granularity (sentence, paragraphs, etc). We show the performance of our proposed method on a manually aligned dataset of 300 Hindi-English sentences and 100 English-Malayalam sentences.

pdf bib
Learning Indonesian-Chinese Lexicon with Bilingual Word Embedding Models and Monolingual Signals
Xinying Qiu | Gangqin Zhu

We present a research on learning Indonesian-Chinese bilingual lexicon using monolingual word embedding and bilingual seed lexicons to build shared bilingual word embedding space. We take the first attempt to examine the impact of different monolingual signals for the choice of seed lexicons on the model performance. We found that although monolingual signals alone do not seem to outperform signals coverings all words, the significant improvement for learning word translation of the same signal types may suggest that linguistic features possess value for further study in distinguishing the semantic margins of the shared word embedding space.

pdf bib
Creating rich online dictionaries for the Lao–French language pair, reusable for Machine Translation
Vincent Berment

In this paper, we present how we generated two rich online bilingual dictionaries — Lao-French and French-Lao — from unstructured dictionaries in Microsoft Word files. Then we shortly discuss the possible reuse of the lexical data for Machine Translation projects.

up

pdf (full)
bib (full)
Proceedings of the Workshop on Grammar and Lexicon: interactions and interfaces (GramLex)

pdf bib
Proceedings of the Workshop on Grammar and Lexicon: interactions and interfaces (GramLex)
Eva Hajičová | Igor Boguslavsky

pdf bib
Information structure, syntax, and pragmatics and other factors in resolving scope ambiguity
Valentina Apresjan

The paper is a corpus study of the factors involved in disambiguating potential scope ambiguity in sentences with negation and universal quantifier, such as “I don’t want talk to all these people”, which can alternatively mean ‘I don’t want to talk to any of these people’ and ‘I don’t want to talk to some of these people’. The relevant factors are demonstrated to be largely different from those involved in disambiguating lexical polysemy. They include the syntactic function of the constituent containing “all” quantifier (subject, direct complement, adjunct), as well as the deepness of its embedding; the status of the main predicate and “all” constituent with respect to the information structure of the 6utterance (topic vs. focus, given vs. new information); pragmatic implicatures pertaining to the situations described in the utterances.

pdf bib
Multiword Expressions at the Grammar-Lexicon Interface
Timothy Baldwin

In this talk, I will outline a range of challenges presented by multiword expressions in terms of (lexicalist) precision grammar engineering, and different strategies for accommodating those challenges, in an attempt to strike the right balance in terms of generalisation and over- and under-generation.

pdf bib
Microsyntactic Phenomena as a Computational Linguistics Issue
Leonid Iomdin

Microsyntactic linguistic units, such as syntactic idioms and non-standard syntactic constructions, are poorly represented in linguistic resources, mostly because the former are elements occupying an intermediate position between the lexicon and the grammar and the latter are too specific to be routinely tackled by general grammars. Consequently, many such units produce substantial gaps in systems intended to solve sophisticated computational linguistics tasks, such as parsing, deep semantic analysis, question answering, machine translation, or text generation. They also present obstacles for applying advanced techniques to these tasks, such as machine learning. The paper discusses an approach aimed at bridging such gaps, focusing on the development of monolingual and multilingual corpora where microsyntactic units are to be tagged.

pdf bib
Alternations: From Lexicon to Grammar And Back Again
Markéta Lopatková | Václava Kettnerová

An excellent example of a phenomenon bridging a lexicon and a grammar is provided by grammaticalized alternations (e.g., passivization, reflexivity, and reciprocity): these alternations represent productive grammatical processes which are, however, lexically determined. While grammaticalized alternations keep lexical meaning of verbs unchanged, they are usually characterized by various changes in their morphosyntactic structure. In this contribution, we demonstrate on the example of reciprocity and its representation in the valency lexicon of Czech verbs, VALLEX how a linguistic description of complex (and still systemic) changes characteristic of grammaticalized alternations can benefit from an integration of grammatical rules into a valency lexicon. In contrast to other types of grammaticalized alternations, reciprocity in Czech has received relatively little attention although it closely interacts with various linguistic phenomena (e.g., with light verbs, diatheses, and reflexivity).

pdf bib
Extra-Specific Multiword Expressions for Language-Endowed Intelligent Agents
Marjorie McShane | Sergei Nirenburg

Language-endowed intelligent agents benefit from leveraging lexical knowledge falling at different points along a spectrum of compositionality. This means that robust computational lexicons should include not only the compositional expectations of argument-taking words, but also non-compositional collocations (idioms), semi-compositional collocations that might be difficult for an agent to interpret (e.g., standard metaphors), and even collocations that could be compositionally analyzed but are so frequently encountered that recording their meaning increases the efficiency of interpretation. In this paper we argue that yet another type of string-to-meaning mapping can also be useful to intelligent agents: remembered semantic analyses of actual text inputs. These can be viewed as super-specific multi-word expressions whose recorded interpretations mimic a person’s memories of knowledge previously learned from language input. These differ from typical annotated corpora in two ways. First, they provide a full, context-sensitive semantic interpretation rather than select features. Second, they are are formulated in the ontologically-grounded metalanguage used in a particular agent environment, meaning that the interpretations contribute to the dynamically evolving cognitive capabilites of agents configured in that environment.

pdf bib
Universal Dependencies: A Cross-Linguistic Perspective on Grammar and Lexicon
Joakim Nivre

Universal Dependencies is an initiative to develop cross-linguistically consistent grammatical annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning and parsing research from a language typology perspective. It assumes a dependency-based approach to syntax and a lexicalist approach to morphology, which together entail that the fundamental units of grammatical annotation are words. Words have properties captured by morphological annotation and enter into relations captured by syntactic annotation. Moreover, priority is given to relations between lexical content words, as opposed to grammatical function words. In this position paper, I discuss how this approach allows us to capture similarities and differences across typologically diverse languages.

pdf bib
The Development of Multimodal Lexical Resources
James Pustejovsky | Tuan Do | Gitit Kehat | Nikhil Krishnaswamy

Human communication is a multimodal activity, involving not only speech and written expressions, but intonation, images, gestures, visual clues, and the interpretation of actions through perception. In this paper, we describe the design of a multimodal lexicon that is able to accommodate the diverse modalities that present themselves in NLP applications. We have been developing a multimodal semantic representation, VoxML, that integrates the encoding of semantic, visual, gestural, and action-based features associated with linguistic expressions.

pdf bib
On the Non-canonical Valency Filling
Igor Boguslavsky

Valency slot filling is a semantic glue, which brings together the meanings of words. As regards the position of an argument in the dependency structure with respect to its predicate, there exist three types of valency filling: active (canonical), passive, and discontinuous. Of these, the first type is studied much better than the other two. As a rule, canonical actants are unambiguously marked in the syntactic structure, and each actant corresponds to a unique syntactic position. Linguistic information on which syntactic function an actant might have (subject, direct or indirect object), what its morphological form should be and which prepositions or conjunctions it requires, can be given in the lexicon in the form of government patterns, subcategorization frames, or similar data structures. We concentrate on non-canonical cases of valency filling in Russian, which are characteristic of non-verbal parts of speech, such as adverbs, adjectives, and particles, in the first place. They are more difficult to handle than canonical ones, because the position of the actant in the tree is governed by more complicated rules. A valency may be filled by expressions occupying different syntactic positions, and a syntactic position may accept expressions filling different valencies of the same word. We show how these phenomena can be processed in a semantic analyzer.

pdf bib
Improvement of VerbNet-like resources by frame typing
Laurence Danlos | Matthieu Constant | Lucie Barque

Verbenet is a French lexicon developed by “translation” of its English counterpart — VerbNet (Kipper-Schuler, 2005)—and treatment of the specificities of French syntax (Pradet et al., 2014; Danlos et al., 2016). One difficulty encountered in its development springs from the fact that the list of (potentially numerous) frames has no internal organization. This paper proposes a type system for frames that shows whether two frames are variants of a given alternation. Frame typing facilitates coherence checking of the resource in a “virtuous circle”. We present the principles underlying a program we developed and used to automatically type frames in VerbeNet. We also show that our system is portable to other languages.

pdf bib
Enriching a Valency Lexicon by Deverbative Nouns
Eva Fučíková | Jan Hajič | Zdeňka Urešová

We present an attempt to automatically identify Czech deverbative nouns using several methods that use large corpora as well as existing lexical resources. The motivation for the task is to extend a verbal valency (i.e., predicate-argument) lexicon by adding nouns that share the valency properties with the base verb, assuming their properties can be derived (even if not trivially) from the underlying verb by deterministic grammatical rules. At the same time, even in inflective languages, not all deverbatives are simply created from their underlying base verb by regular lexical derivation processes. We have thus developed hybrid techniques that use both large parallel corpora and several standard lexical resources. Thanks to the use of parallel corpora, the resulting sets contain also synonyms, which the lexical derivation rules cannot get. For evaluation, we have manually created a small, 100-verb gold data since no such dataset was initially available for Czech.

pdf bib
The Grammar of English Deverbal Compounds and their Meaning
Gianina Iordăchioaia | Lonneke van der Plas | Glorianna Jagfeld

We present an interdisciplinary study on the interaction between the interpretation of noun-noun deverbal compounds (DCs; e.g., task assignment) and the morphosyntactic properties of their deverbal heads in English. Underlying hypotheses from theoretical linguistics are tested with tools and resources from computational linguistics. We start with Grimshaw’s (1990) insight that deverbal nouns are ambiguous between argument-supporting nominal (ASN) readings, which inherit verbal arguments (e.g., the assignment of the tasks), and the less verbal and more lexicalized Result Nominal and Simple Event readings (e.g., a two-page assignment). Following Grimshaw, our hypothesis is that the former will realize object arguments in DCs, while the latter will receive a wider range of interpretations like root compounds headed by non-derived nouns (e.g., chocolate box). Evidence from a large corpus assisted by machine learning techniques confirms this hypothesis, by showing that, besides other features, the realization of internal arguments by deverbal heads outside compounds (i.e., the most distinctive ASN-property in Grimshaw 1990) is a good predictor for an object interpretation of non-heads in DCs.

pdf bib
Encoding a syntactic dictionary into a super granular unification grammar
Sylvain Kahane | François Lareau

We show how to turn a large-scale syntactic dictionary into a dependency-based unification grammar where each piece of lexical information calls a separate rule, yielding a super granular grammar. Subcategorization, raising and control verbs, auxiliaries and copula, passivization, and tough-movement are discussed. We focus on the semantics-syntax interface and offer a new perspective on syntactic structure.

pdf bib
Identification of Flexible Multiword Expressions with the Help of Dependency Structure Annotation
Ayaka Morimoto | Akifumi Yoshimoto | Akihiko Kato | Hiroyuki Shindo | Yuji Matsumoto

This paper presents our ongoing work on compilation of English multi-word expression (MWE) lexicon. We are especially interested in collecting flexible MWEs, in which some other components can intervene the expression such as “a number of” vs “a large number of” where a modifier of “number” can be placed in the expression and inherit the original meaning. We fiest collect possible candidates of flexible English MWEs from the web, and annotate all of their occurrences in the Wall Street Journal portion of Ontonotes corpus. We make use of word dependency strcuture information of the sentences converted from the phrase structure annotation. This process enables semi-automatic annotation of MWEs in the corpus and simultanaously produces the internal and external dependency representation of flexible MWEs.

pdf bib
A new look at possessive reflexivization: A comparative study between Czech and Russian
Anna Nedoluzhko

The paper presents a contrastive description of reflexive possessive pronouns “svůj” in Czech and “svoj” in Russian. The research concerns syntactic, semantic and pragmatic aspects. With our analysis, we shed a new light on the already investigated issue, which comes from a detailed comparison of the phenomenon of possessive reflexivization in two typologically and genetically similar languages. We show that whereas in Czech, the possessive reflexivization is mostly limited to syntactic functions and does not go beyond the grammar, in Russian it gets additional semantic meanings and moves substan-tially towards the lexicon. The obtained knowledge allows us to explain heretofore unclear marginal uses of reflexives in each language.

pdf bib
Modeling non-standard language
Alexandr Rosen

A specific language as used by different speakers and in different situations has a number of more or less distant varieties. Extending the notion of non-standard language to varieties that do not fit an explicitly or implicitly assumed norm or pattern, we look for methods and tools that could be applied to this domain. The needs start from the theoretical side: categories usable for the analysis of non-standard language are not readily available, and continue to methods and tools required for its detection and diagnostics. A general discussion of issues related to non-standard language is followed by two case studies. The first study presents a taxonomy of morphosyntactic categories as an attempt to analyse non-standard forms produced by non-native learners of Czech. The second study focusses on the role of a rule-based grammar and lexicon in the process of building and using a parsebank.

up

pdf (full)
bib (full)
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

pdf bib
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)
Bo Han | Alan Ritter | Leon Derczynski | Wei Xu | Tim Baldwin

pdf bib
Processing non-canonical or noisy text: fortuitous data to the rescue
Barbara Plank

Real world data differs radically from the benchmark corpora we use in NLP, resulting in large performance drops. The reason for this problem is obvious: NLP models are trained on limited samples from canonical varieties considered standard. However, there are many dimensions, e.g., sociodemographic, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language. In this talk, I review the notion of canonicity, and how it shapes our community’s approach to language. I argue for the use of fortuitous data. Fortuitous data is data out there that just waits to be harvested. It includes data which is in plain sight, but is often neglected, and more distant sources like behavioral data, which first need to be refined. They provide additional contexts and a myriad of opportunities to build more adaptive language technology, some of which I will explore in this talk.

pdf bib
From Entity Linking to Question Answering – Recent Progress on Semantic Grounding Tasks
Ming-Wei Chang

Entity linking and semantic parsing have been shown to be crucial to important applications such as question answering and document understanding. These tasks often require structured learning models, which make predictions on multiple interdependent variables. In this talk, I argue that carefully designed structured learning algorithms play a central role in entity linking and semantic parsing tasks. In particular, I will present several new structured learning models for entity linking, which jointly detect mentions and disambiguate entities as well as capture non-textual information. I will then show how to use a staged search procedure to building a state-of-the-art knowledge base question answering system. Finally, if time permits, I will discuss different supervision protocols for training semantic parsers and the value of labeling semantic parses.

pdf bib
DISAANA and D-SUMM: Large-scale Real Time NLP Systems for Analyzing Disaster Related Reports in Tweets
Kentaro Torisawa

This talk presents two NLP systems that were developed for helping disaster victims and rescue workers in the aftermath of large-scale disasters. DISAANA provides answers to questions such as “What is in short supply in Tokyo?” and displays locations related to each answer on a map. D-SUMM automatically summarizes a large number of disaster related reports concerning a specified area and helps rescue workers to understand disaster situations from a macro perspective. Both systems are publicly available as Web services. In the aftermath of the 2016 Kumamoto Earthquake (M7.0), the Japanese government actually used DISAANA to analyze the situation.

pdf bib
Private or Corporate? Predicting User Types on Twitter
Nikola Ljubešić | Darja Fišer

In this paper we present a series of experiments on discriminating between private and corporate accounts on Twitter. We define features based on Twitter metadata, morphosyntactic tags and surface forms, showing that the simple bag-of-words model achieves single best results that can, however, be improved by building a weighted soft ensemble of classifiers based on each feature type. Investigating the time and language dependence of each feature type delivers quite unexpecting results showing that features based on metadata are neither time- nor language-insensitive as the way the two user groups use the social network varies heavily through time and space.

pdf bib
From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario
Héctor Martínez Alonso | Djamé Seddah | Benoît Sagot

User-generated content presents many challenges for its automatic processing. While many of them do come from out-of-vocabulary effects, others spawn from different linguistic phenomena such as unusual syntax. In this work we present a French three-domain data set made up of question headlines from a cooking forum, game chat logs and associated forums from two popular online games (MINECRAFT & LEAGUE OF LEGENDS). We chose these domains because they encompass different degrees of lexical and syntactic compliance with canonical language. We conduct an automatic and manual evaluation of the difficulties of processing these domains for part-of-speech prediction, and introduce a pilot study to determine whether dependency analysis lends itself well to annotate these data. We also discuss the development cost of our data set.

pdf bib
Disaster Analysis using User-Generated Weather Report
Yasunobu Asakura | Masatsugu Hangyo | Mamoru Komachi

Information extraction from user-generated text has gained much attention with the growth of the Web.Disaster analysis using information from social media provides valuable, real-time, geolocation information for helping people caught up these in disasters. However, it is not convenient to analyze texts posted on social media because disaster keywords match any texts that contain words. For collecting posts about a disaster from social media, we need to develop a classifier to filter posts irrelevant to disasters. Moreover, because of the nature of social media, we can take advantage of posts that come with GPS information. However, a post does not always refer to an event occurring at the place where it has been posted. Therefore, we propose a new task of classifying whether a flood disaster occurred, in addition to predicting the geolocation of events from user-generated text. We report the annotation of the flood disaster corpus and develop a classifier to demonstrate the use of this corpus for disaster analysis.

pdf bib
Veracity Computing from Lexical Cues and Perceived Certainty Trends
Uwe Reichel | Piroska Lendvai

We present a data-driven method for determining the veracity of a set of rumorous claims on social media data. Tweets from different sources pertaining to a rumor are processed on three levels: first, factuality values are assigned to each tweet based on four textual cue categories relevant for our journalism use case; these amalgamate speaker support in terms of polarity and commitment in terms of certainty and speculation. Next, the proportions of these lexical cues are utilized as predictors for tweet certainty in a generalized linear regression model. Subsequently, lexical cue proportions, predicted certainty, as well as their time course characteristics are used to compute veracity for each rumor in terms of the identity of the rumor-resolving tweet and its binary resolution value judgment. The system operates without access to extralinguistic resources. Evaluated on the data portion for which hand-labeled examples were available, it achieves .74 F1-score on identifying rumor resolving tweets and .76 F1-score on predicting if a rumor is resolved as true or false.

pdf bib
A Simple but Effective Approach to Improve Arabizi-to-English Statistical Machine Translation
Marlies van der Wees | Arianna Bisazza | Christof Monz

A major challenge for statistical machine translation (SMT) of Arabic-to-English user-generated text is the prevalence of text written in Arabizi, or Romanized Arabic. When facing such texts, a translation system trained on conventional Arabic-English data will suffer from extremely low model coverage. In addition, Arabizi is not regulated by any official standardization and therefore highly ambiguous, which prevents rule-based approaches from achieving good translation results. In this paper, we improve Arabizi-to-English machine translation by presenting a simple but effective Arabizi-to-Arabic transliteration pipeline that does not require knowledge by experts or native Arabic speakers. We incorporate this pipeline into a phrase-based SMT system, and show that translation quality after automatically transliterating Arabizi to Arabic yields results that are comparable to those achieved after human transliteration.

pdf bib
Name Variation in Community Question Answering Systems
Anietie Andy | Satoshi Sekine | Mugizi Rwebangira | Mark Dredze

Name Variation in Community Question Answering Systems Abstract Community question answering systems are forums where users can ask and answer questions in various categories. Examples are Yahoo! Answers, Quora, and Stack Overflow. A common challenge with such systems is that a significant percentage of asked questions are left unanswered. In this paper, we propose an algorithm to reduce the number of unanswered questions in Yahoo! Answers by reusing the answer to the most similar past resolved question to the unanswered question, from the site. Semantically similar questions could be worded differently, thereby making it difficult to find questions that have shared needs. For example, “Who is the best player for the Reds?” and “Who is currently the biggest star at Manchester United?” have a shared need but are worded differently; also, “Reds” and “Manchester United” are used to refer to the soccer team Manchester United football club. In this research, we focus on question categories that contain a large number of named entities and entity name variations. We show that in these categories, entity linking can be used to identify relevant past resolved questions with shared needs as a given question by disambiguating named entities and matching these questions based on the disambiguated entities, identified entities, and knowledge base information related to these entities. We evaluated our algorithm on a new dataset constructed from Yahoo! Answers. The dataset contains annotated question pairs, (Qgiven, [Qpast, Answer]). We carried out experiments on several question categories and show that an entity-based approach gives good performance when searching for similar questions in entity rich categories.

pdf bib
Whose Nickname is This? Recognizing Politicians from Their Aliases
Wei-Chung Wang | Hung-Chen Chen | Zhi-Kai Ji | Hui-I Hsiao | Yu-Shian Chiu | Lun-Wei Ku

Using aliases to refer to public figures is one way to make fun of people, to express sarcasm, or even to sidestep legal issues when expressing opinions on social media. However, linking an alias back to the real name is difficult, as it entails phonemic, graphemic, and semantic challenges. In this paper, we propose a phonemic-based approach and inject semantic information to align aliases with politicians’ Chinese formal names. The proposed approach creates an HMM model for each name to model its phonemes and takes into account document-level pairwise mutual information to capture the semantic relations to the alias. In this work we also introduce two new datasets consisting of 167 phonemic pairs and 279 mixed pairs of aliases and formal names. Experimental results show that the proposed approach models both phonemic and semantic information and outperforms previous work on both the phonemic and mixed datasets with the best top-1 accuracies of 0.78 and 0.59 respectively.

pdf bib
Towards Accurate Event Detection in Social Media: A Weakly Supervised Approach for Learning Implicit Event Indicators
Ajit Jain | Girish Kasiviswanathan | Ruihong Huang

Accurate event detection in social media is very challenging because user generated contents are extremely noisy and sparse in content. Event indicators are generally words or phrases that act as a trigger that help us understand the semantics of the context they occur in. We present a weakly supervised approach that relies on using a single strong event indicator phrase as a seed to acquire a variety of additional event cues. We propose to leverage various types of implicit event indicators, such as props, actors and precursor events, to achieve precise event detection. We experimented with civil unrest events and show that the automatically learnt event indicators are effective in identifying specific types of events.

pdf bib
Unsupervised Stemmer for Arabic Tweets
Fahad Albogamy | Allan Ramsay

Stemming is an essential processing step in a wide range of high level text processing applications such as information extraction, machine translation and sentiment analysis. It is used to reduce words to their stems. Many stemming algorithms have been developed for Modern Standard Arabic (MSA). Although Arabic tweets and MSA are closely related and share many characteristics, there are substantial differences between them in lexicon and syntax. In this paper, we introduce a light Arabic stemmer for Arabic tweets. Our results show improvements over the performance of a number of well-known stemmers for Arabic.

pdf bib
Topic Stability over Noisy Sources
Jing Su | Derek Greene | Oisín Boydell

Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise can have diverse effects on the stability of topic models. On the other hand, topic model stability is not consistent with the same type but different levels of noise. We introduce a dictionary filtering approach to address this challenge, with the result that a topic model with the correct number of topics is always identified across different levels of noise.

pdf bib
Analysis of Twitter Data for Postmarketing Surveillance in Pharmacovigilance
Julie Pain | Jessie Levacher | Adam Quinquenel | Anja Belz

Postmarketing surveillance (PMS) has the vital aim to monitor effects of drugs after release for use by the general population, but suffers from under-reporting and limited coverage. Automatic methods for detecting drug effect reports, especially for social media, could vastly increase the scope of PMS. Very few automatic PMS methods are currently available, in particular for the messy text types encountered on Twitter. In this paper we describe first results for developing PMS methods specifically for tweets. We describe the corpus of 125,669 tweets we have created and annotated to train and test the tools. We find that generic tools perform well for tweet-level language identification and tweet-level sentiment analysis (both 0.94 F1-Score). For detection of effect mentions we are able to achieve 0.87 F1-Score, while effect-level adverse-vs.-beneficial analysis proves harder with an F1-Score of 0.64. Among other things, our results indicate that MetaMap semantic types provide a very promising basis for identifying drug effect mentions in tweets.

pdf bib
Named Entity Recognition and Hashtag Decomposition to Improve the Classification of Tweets
Billal Belainine | Alexsandro Fonseca | Fatiha Sadat

In social networks services like Twitter, users are overwhelmed with huge amount of social data, most of which are short, unstructured and highly noisy. Identifying accurate information from this huge amount of data is indeed a hard task. Classification of tweets into organized form will help the user to easily access these required information. Our first contribution relates to filtering parts of speech and preprocessing this kind of highly noisy and short data. Our second contribution concerns the named entity recognition (NER) in tweets. Thus, the adaptation of existing language tools for natural languages, noisy and not accurate language tweets, is necessary. Our third contribution involves segmentation of hashtags and a semantic enrichment using a combination of relations from WordNet, which helps the performance of our classification system, including disambiguation of named entities, abbreviations and acronyms. Graph theory is used to cluster the words extracted from WordNet and tweets, based on the idea of connected components. We test our automatic classification system with four categories: politics, economy, sports and the medical field. We evaluate and compare several automatic classification systems using part or all of the items described in our contributions and found that filtering by part of speech and named entity recognition dramatically increase the classification precision to 77.3 %. Moreover, a classification system incorporating segmentation of hashtags and semantic enrichment by two relations from WordNet, synonymy and hyperonymy, increase classification precision up to 83.4 %.

pdf bib
Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization
Thales Felipe Costa Bertaglia | Maria das Graças Volpe Nunes

Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embeddings). It generates continuous numeric vectors of high-dimensionality to represent words. The vectors explicitly encode many linguistic regularities and patterns, as well as syntactic and semantic word relationships. Words that share semantic similarity are represented by similar vectors. Based on these features, we present a totally unsupervised, expandable and language and domain independent method for learning normalization lexicons from word embeddings. Our approach obtains high correction rate of orthographic errors and internet slang in product reviews, outperforming the current available tools for Brazilian Portuguese.

pdf bib
How Document Pre-processing affects Keyphrase Extraction Performance
Florian Boudin | Hugo Mougard | Damien Cram

The SemEval-2010 benchmark dataset has brought renewed attention to the task of automatic keyphrase extraction. This dataset is made up of scientific articles that were automatically converted from PDF format to plain text and thus require careful preprocessing so that irrevelant spans of text do not negatively affect keyphrase extraction performance. In previous work, a wide range of document preprocessing techniques were described but their impact on the overall performance of keyphrase extraction models is still unexplored. Here, we re-assess the performance of several keyphrase extraction models and measure their robustness against increasingly sophisticated levels of document preprocessing.

pdf bib
Japanese Text Normalization with Encoder-Decoder Model
Taishi Ikeda | Hiroyuki Shindo | Yuji Matsumoto

Text normalization is the task of transforming lexical variants to their canonical forms. We model the problem of text normalization as a character-level sequence to sequence learning problem and present a neural encoder-decoder model for solving it. To train the encoder-decoder model, many sentences pairs are generally required. However, Japanese non-standard canonical pairs are scarce in the form of parallel corpora. To address this issue, we propose a method of data augmentation to increase data size by converting existing resources into synthesized non-standard forms using handcrafted rules. We conducted an experiment to demonstrate that the synthesized corpus contributes to stably train an encoder-decoder model and improve the performance of Japanese text normalization.

pdf bib
Results of the WNUT16 Named Entity Recognition Shared Task
Benjamin Strauss | Bethany Toma | Alan Ritter | Marie-Catherine de Marneffe | Wei Xu

This paper presents the results of the Twitter Named Entity Recognition shared task associated with W-NUT 2016: a named entity tagging task with 10 teams participating. We outline the shared task, annotation process and dataset statistics, and provide a high-level overview of the participating systems for each shared task.

pdf bib
Bidirectional LSTM for Named Entity Recognition in Twitter Messages
Nut Limsopatham | Nigel Collier

In this paper, we present our approach for named entity recognition in Twitter messages that we used in our participation in the Named Entity Recognition in Twitter shared task at the COLING 2016 Workshop on Noisy User-generated text (WNUT). The main challenge that we aim to tackle in our participation is the short, noisy and colloquial nature of tweets, which makes named entity recognition in Twitter message a challenging task. In particular, we investigate an approach for dealing with this problem by enabling bidirectional long short-term memory (LSTM) to automatically learn orthographic features without requiring feature engineering. In comparison with other systems participating in the shared task, our system achieved the most effective performance on both the ‘segmentation and categorisation’ and the ‘segmentation only’ sub-tasks.

pdf bib
Learning to recognise named entities in tweets by exploiting weakly labelled data
Kurt Junshean Espinosa | Riza Theresa Batista-Navarro | Sophia Ananiadou

Named entity recognition (NER) in social media (e.g., Twitter) is a challenging task due to the noisy nature of text. As part of our participation in the W-NUT 2016 Named Entity Recognition Shared Task, we proposed an unsupervised learning approach using deep neural networks and leverage a knowledge base (i.e., DBpedia) to bootstrap sparse entity types with weakly labelled data. To further boost the performance, we employed a more sophisticated tagging scheme and applied dropout as a regularisation technique in order to reduce overfitting. Even without hand-crafting linguistic features nor leveraging any of the W-NUT-provided gazetteers, we obtained robust performance with our approach, which ranked third amongst all shared task participants according to the official evaluation on a gold standard named entity-annotated corpus of 3,856 tweets.

pdf bib
Feature-Rich Twitter Named Entity Recognition and Classification
Utpal Kumar Sikdar | Björn Gambäck

Twitter named entity recognition is the process of identifying proper names and classifying them into some predefined labels/categories. The paper introduces a Twitter named entity system using a supervised machine learning approach, namely Conditional Random Fields. A large set of different features was developed and the system was trained using these. The Twitter named entity task can be divided into two parts: i) Named entity extraction from tweets and ii) Twitter name classification into ten different types. For Twitter named entity recognition on unseen test data, our system obtained the second highest F1 score in the shared task: 63.22%. The system performance on the classification task was worse, with an F1 measure of 40.06% on unseen test data, which was the fourth best of the ten systems participating in the shared task.

pdf bib
Learning to Search for Recognizing Named Entities in Twitter
Ioannis Partalas | Cédric Lopez | Nadia Derbas | Ruslan Kalitvianski

We presented in this work our participation in the 2nd Named Entity Recognition for Twitter shared task. The task has been cast as a sequence labeling one and we employed a learning to search approach in order to tackle it. We also leveraged LOD for extracting rich contextual features for the named-entities. Our submission achieved F-scores of 46.16 and 60.24 for the classification and the segmentation tasks and ranked 2nd and 3rd respectively. The post-analysis showed that LOD features improved substantially the performance of our system as they counter-balance the lack of context in tweets. The shared task gave us the opportunity to test the performance of NER systems in short and noisy textual data. The results of the participated systems shows that the task is far to be considered as a solved one and methods with stellar performance in normal texts need to be revised.

pdf bib
DeepNNNER: Applying BLSTM-CNNs and Extended Lexicons to Named Entity Recognition in Tweets
Fabrice Dugas | Eric Nichols

In this paper, we describe the DeepNNNER entry to The 2nd Workshop on Noisy User-generated Text (WNUT) Shared Task #2: Named Entity Recognition in Twitter. Our shared task submission adopts the bidirectional LSTM-CNN model of Chiu and Nichols (2016), as it has been shown to perform well on both newswire and Web texts. It uses word embeddings trained on large-scale Web text collections together with text normalization to cope with the diversity in Web texts, and lexicons for target named entity classes constructed from publicly-available sources. Extended evaluation comparing the effectiveness of various word embeddings, text normalization, and lexicon settings shows that our system achieves a maximum F1-score of 47.24, performance surpassing that of the shared task’s second-ranked system.

pdf bib
ASU: An Experimental Study on Applying Deep Learning in Twitter Named Entity Recognition.
Michel Naim Gerguis | Cherif Salama | M. Watheq El-Kharashi

This paper describes the ASU system submitted in the COLING W-NUT 2016 Twitter Named Entity Recognition (NER) task. We present an experimental study on applying deep learning to extracting named entities (NEs) from tweets. We built two Long Short-Term Memory (LSTM) models for the task. The first model was built to extract named entities without types while the second model was built to extract and then classify them into 10 fine-grained entity classes. In this effort, we show detailed experimentation results on the effectiveness of word embeddings, brown clusters, part-of-speech (POS) tags, shape features, gazetteers, and local context for the tweet input vector representation to the LSTM model. Also, we present a set of experiments, to better design the network parameters for the Twitter NER task. Our system was ranked the fifth out of ten participants with a final f1-score for the typed classes of 39% and 55% for the non typed ones.

pdf bib
UQAM-NTL: Named entity recognition in Twitter messages
Ngoc Tan Le | Fatma Mallek | Fatiha Sadat

This paper describes our system used in the 2nd Workshop on Noisy User-generated Text (WNUT) shared task for Named Entity Recognition (NER) in Twitter, in conjunction with Coling 2016. Our system is based on supervised machine learning by applying Conditional Random Fields (CRF) to train two classifiers for two evaluations. The first evaluation aims at predicting the 10 fine-grained types of named entities; while the second evaluation aims at predicting no type of named entities. The experimental results show that our method has significantly improved Twitter NER performance.

pdf bib
Semi-supervised Named Entity Recognition in noisy-text
Shubhanshu Mishra | Jana Diesner

Many of the existing Named Entity Recognition (NER) solutions are built based on news corpus data with proper syntax. These solutions might not lead to highly accurate results when being applied to noisy, user generated data, e.g., tweets, which can feature sloppy spelling, concept drift, and limited contextualization of terms and concepts due to length constraints. The models described in this paper are based on linear chain conditional random fields (CRFs), use the BIEOU encoding scheme, and leverage random feature dropout for up-sampling the training data. The considered features include word clusters and pre-trained distributed word representations, updated gazetteer features, and global context predictions. The latter feature allows for ingesting the meaning of new or rare tokens into the system via unsupervised learning and for alleviating the need to learn lexicon based features, which usually tend to be high dimensional. In this paper, we report on the solution [ST] we submitted to the WNUT 2016 NER shared task. We also present an improvement over our original submission [SI], which we built by using semi-supervised learning on labelled training data and pre-trained resourced constructed from unlabelled tweet data. Our ST solution achieved an F1 score of 1.2% higher than the baseline (35.1% F1) for the task of extracting 10 entity types. The SI resulted in an increase of 8.2% in F1 score over the base-line (7.08% over ST). Finally, the SI model’s evaluation on the test data achieved a F1 score of 47.3% (~1.15% increase over the 2nd best submitted solution). Our experimental setup and results are available as a standalone twitter NER tool at https://github.com/napsternxg/TwitterNER.

pdf bib
Twitter Geolocation Prediction Shared Task of the 2016 Workshop on Noisy User-generated Text
Bo Han | Afshin Rahimi | Leon Derczynski | Timothy Baldwin

This paper presents the shared task for English Twitter geolocation prediction in WNUT 2016. We discuss details of task settings, data preparations and participant systems. The derived dataset and performance figures from each system provide baselines for future research in this realm.

pdf bib
CSIRO Data61 at the WNUT Geo Shared Task
Gaya Jayasinghe | Brian Jin | James Mchugh | Bella Robinson | Stephen Wan

In this paper, we describe CSIRO Data61’s participation in the Geolocation shared task at the Workshop for Noisy User-generated Text. Our approach was to use ensemble methods to capitalise on four component methods: heuristics based on metadata, a label propagation method, timezone text classifiers, and an information retrieval approach. The ensembles we explored focused on examining the role of language technologies in geolocation prediction and also in examining the use of hard voting and cascading ensemble methods. Based on the accuracy of city-level predictions, our systems were the best performing submissions at this year’s shared task. Furthermore, when estimating the latitude and longitude of a user, our median error distance was accurate to within 30 kilometers.

pdf bib
Geolocation Prediction in Twitter Using Location Indicative Words and Textual Features
Lianhua Chi | Kwan Hui Lim | Nebula Alam | Christopher J. Butler

Knowing the location of a social media user and their posts is important for various purposes, such as the recommendation of location-based items/services, and locality detection of crisis/disasters. This paper describes our submission to the shared task “Geolocation Prediction in Twitter” of the 2nd Workshop on Noisy User-generated Text. In this shared task, we propose an algorithm to predict the location of Twitter users and tweets using a multinomial Naive Bayes classifier trained on Location Indicative Words and various textual features (such as city/country names, #hashtags and @mentions). We compared our approach against various baselines based on Location Indicative Words, city/country names, #hashtags and @mentions as individual feature sets, and experimental results show that our approach outperforms these baselines in terms of classification accuracy, mean and median error distance.

pdf bib
A Simple Scalable Neural Networks based Model for Geolocation Prediction in Twitter
Yasuhide Miura | Motoki Taniguchi | Tomoki Taniguchi | Tomoko Ohkuma

This paper describes a model that we submitted to W-NUT 2016 Shared task #1: Geolocation Prediction in Twitter. Our model classifies a tweet or a user to a city using a simple neural networks structure with fully-connected layers and average pooling processes. From the findings of previous geolocation prediction approaches, we integrated various user metadata along with message texts and trained the model with them. In the test run of the task, the model achieved the accuracy of 40.91% and the median distance error of 69.50 km in message-level prediction and the accuracy of 47.55% and the median distance error of 16.13 km in user-level prediction. These results are moderate performances in terms of accuracy and best performances in terms of distance. The results show a promising extension of neural networks based models for geolocation prediction where recent advances in neural networks can be added to enhance our current simple model.

up

pdf (full)
bib (full)
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

pdf bib
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
Erhard Hinrichs | Marie Hinrichs | Thorsten Trippel

pdf bib
Flexible and Reliable Text Analytics in the Digital Humanities – Some Methodological Considerations
Jonas Kuhn

The availability of Language Technology Resources and Tools generates a considerable methodological potential in the Digital Humanities: aspects of research questions from the Humanities and Social Sciences can be addressed on text collections in ways that were unavailable to traditional approaches. I start this talk by sketching some sample scenarios of Digital Humanities projects which involve various Humanities and Social Science disciplines, noting that the potential for a meaningful contribution to higher-level questions is highest when the employed language technological models are carefully tailored both (a) to characteristics of the given target corpus, and (b) to relevant analytical subtasks feeding the discipline-specific research questions. Keeping up a multidisciplinary perspective, I then point out a recurrent dilemma in Digital Humanities projects that follow the conventional set-up of collaboration: to build high-quality computational models for the data, fixed analytical targets should be specified as early as possible – but to be able to respond to Humanities questions as they evolve over the course of analysis, the analytical machinery should be kept maximally flexible. To reach both, I argue for a novel collaborative culture that rests on a more interleaved, continuous dialogue. (Re-)Specification of analytical targets should be an ongoing process in which the Humanities Scholars and Social Scientists play a role that is as important as the Computational Scientists’ role. A promising approach lies in the identification of re-occurring types of analytical subtasks, beyond linguistic standard tasks, which can form building blocks for text analysis across disciplines, and for which corpus-based characterizations (viz. annotations) can be collected, compared and revised. On such grounds, computational modeling is more directly tied to the evolving research questions, and hence the seemingly opposing needs of reliable target specifications vs. “malleable” frameworks of analysis can be reconciled. Experimental work following this approach is under way in the Center for Reflected Text Analytics (CRETA) in Stuttgart.

pdf bib
Finding Rising and Falling Words
Erik Tjong Kim Sang

We examine two different methods for finding rising words (among which neologisms) and falling words (among which archaisms) in decades of magazine texts (millions of words) and in years of tweets (billions of words): one based on correlation coefficients of relative frequencies and time, and one based on comparing initial and final word frequencies of time intervals. We find that smoothing frequency scores improves the precision scores of both methods and that the correlation coefficients perform better on magazine text but worse on tweets. Since the two ranking methods find different words they can be used in side-by-side to study the behavior of words over time.

pdf bib
A Dataset for Multimodal Question Answering in the Cultural Heritage Domain
Shurong Sheng | Luc Van Gool | Marie-Francine Moens

Multimodal question answering in the cultural heritage domain allows visitors to ask questions in a more natural way and thus provides better user experiences with cultural objects while visiting a museum, landmark or any other historical site. In this paper, we introduce the construction of a golden standard dataset that will aid research of multimodal question answering in the cultural heritage domain. The dataset, which will be soon released to the public, contains multimodal content including images of typical artworks from the fascinating old-Egyptian Amarna period, related image-containing documents of the artworks and over 800 multimodal queries integrating visual and textual questions. The multimodal questions and related documents are all in English. The multimodal questions are linked to relevant paragraphs in the related documents that contain the answer to the multimodal query.

pdf bib
Extracting Social Networks from Literary Text with Word Embedding Tools
Gerhard Wohlgenannt | Ekaterina Chernyak | Dmitry Ilvovsky

In this paper a social network is extracted from a literary text. The social network shows, how frequent the characters interact and how similar their social behavior is. Two types of similarity measures are used: the first applies co-occurrence statistics, while the second exploits cosine similarity on different types of word embedding vectors. The results are evaluated by a paid micro-task crowdsourcing survey. The experiments suggest that specific types of word embeddings like word2vec are well-suited for the task at hand and the specific circumstances of literary fiction text.

pdf bib
Exploration of register-dependent lexical semantics using word embeddings
Andrey Kutuzov | Elizaveta Kuzmenko | Anna Marakasova

We present an approach to detect differences in lexical semantics across English language registers, using word embedding models from distributional semantics paradigm. Models trained on register-specific subcorpora of the BNC corpus are employed to compare lists of nearest associates for particular words and draw conclusions about their semantic shifts depending on register in which they are used. The models are evaluated on the task of register classification with the help of the deep inverse regression approach. Additionally, we present a demo web service featuring most of the described models and allowing to explore word meanings in different English registers and to detect register affiliation for arbitrary texts. The code for the service can be easily adapted to any set of underlying models.

pdf bib
Original-Transcribed Text Alignment for Manyosyu Written by Old Japanese Language
Teruaki Oka | Tomoaki Kono

We are constructing an annotated diachronic corpora of the Japanese language. In part of thiswork, we construct a corpus of Manyosyu, which is an old Japanese poetry anthology. In thispaper, we describe how to align the transcribed text and its original text semiautomatically to beable to cross-reference them in our Manyosyu corpus. Although we align the original charactersto the transcribed words manually, we preliminarily align the transcribed and original charactersby using an unsupervised automatic alignment technique of statistical machine translation toalleviate the work. We found that automatic alignment achieves an F1-measure of 0.83; thus, each poem has 1–2 alignment errors. However, finding these errors and modifying them are less workintensiveand more efficient than fully manual annotation. The alignment probabilities can beutilized in this modification. Moreover, we found that we can locate the uncertain transcriptionsin our corpus and compare them to other transcriptions, by using the alignment probabilities.

pdf bib
Shamela: A Large-Scale Historical Arabic Corpus
Yonatan Belinkov | Alexander Magidow | Maxim Romanov | Avi Shmidman | Moshe Koppel

Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.

pdf bib
Feelings from the Past—Adapting Affective Lexicons for Historical Emotion Analysis
Sven Buechel | Johannes Hellrich | Udo Hahn

We here describe a novel methodology for measuring affective language in historical text by expanding an affective lexicon and jointly adapting it to prior language stages. We automatically construct a lexicon for word-emotion association of 18th and 19th century German which is then validated against expert ratings. Subsequently, this resource is used to identify distinct emotional patterns and trace long-term emotional trends in different genres of writing spanning several centuries.

pdf bib
Automatic parsing as an efficient pre-annotation tool for historical texts
Hanne Martine Eckhoff | Aleksandrs Berdičevskis

Historical treebanks tend to be manually annotated, which is not surprising, since state-of-the-art parsers are not accurate enough to ensure high-quality annotation for historical texts. We test whether automatic parsing can be an efficient pre-annotation tool for Old East Slavic texts. We use the TOROT treebank from the PROIEL treebank family. We convert the PROIEL format to the CONLL format and use MaltParser to create syntactic pre-annotation. Using the most conservative evaluation method, which takes into account PROIEL-specific features, MaltParser by itself yields 0.845 unlabelled attachment score, 0.779 labelled attachment score and 0.741 secondary dependency accuracy (note, though, that the test set comes from a relatively simple genre and contains rather short sentences). Experiments with human annotators show that preparsing, if limited to sentences where no changes to word or sentence boundaries are required, increases their annotation rate. For experienced annotators, the speed gain varies from 5.80% to 16.57%, for inexperienced annotators from 14.61% to 32.17% (using conservative estimates). There are no strong reliable differences in the annotation accuracy, which means that there is no reason to suspect that using preparsing might lower the final annotation quality.

pdf bib
A Visual Representation of Wittgenstein’s Tractatus Logico-Philosophicus
Anca Bucur | Sergiu Nisioi

In this paper we will discuss a method for data visualization together with its potential usefulness in digital humanities and philosophy of language. We compiled a multilingual parallel corpus from different versions of Wittgenstein’s Tractatus Logico-philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages.

pdf bib
A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures
Richard Eckart de Castilho | Éva Mújdricza-Maydt | Seid Muhie Yimam | Silvana Hartmann | Iryna Gurevych | Anette Frank | Chris Biemann

We introduce the third major release of WebAnno, a generic web-based annotation tool for distributed teams. New features in this release focus on semantic annotation tasks (e.g. semantic role labelling or event annotation) and allow the tight integration of semantic annotations with syntactic annotations. In particular, we introduce the concept of slot features, a novel constraint mechanism that allows modelling the interaction between semantic and syntactic annotations, as well as a new annotation user interface. The new features were developed and used in an annotation project for semantic roles on German texts. The paper briefly introduces this project and reports on experiences performing annotations with the new tool. On a comparative evaluation, our tool reaches significant speedups over WebAnno 2 for a semantic annotation task.

pdf bib
Challenges and Solutions for Latin Named Entity Recognition
Alexander Erdmann | Christopher Brown | Brian Joseph | Mark Janse | Petra Ajaka | Micha Elsner | Marie-Catherine de Marneffe

Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality.

pdf bib
Geographical Visualization of Search Results in Historical Corpora
Florian Petran

We present ANNISVis, a webapp for comparative visualization of geographical distribution of linguistic data, as well as a sample deployment for a corpus of Middle High German texts. Unlike existing geographical visualization solutions, which work with pre-existing data sets, or are bound to specific corpora, ANNISVis allows the user to formulate multiple ad-hoc queries and visualizes them on a map, and it can be configured for any corpus that can be imported into ANNIS. This enables explorative queries of the quantitative aspects of a corpus with geographical features. The tool will be made available to download in open source.

pdf bib
Implementation of a Workflow Management System for Non-Expert Users
Bart Jongejan

In the Danish CLARIN-DK infrastructure, chaining language technology (LT) tools into a workflow is easy even for a non-expert user, because she only needs to specify the input and the desired output of the workflow. With this information and the registered input and output profiles of the available tools, the CLARIN-DK workflow management system (WMS) computes combinations of tools that will give the desired result. This advanced functionality was originally not envisaged, but came within reach by writing the WMS partly in Java and partly in a programming language for symbolic computation, Bracmat. Handling LT tool profiles, including the computation of workflows, is easier with Bracmat’s language constructs for tree pattern matching and tree construction than with the language constructs offered by mainstream programming languages.

pdf bib
Integrating Optical Character Recognition and Machine Translation of Historical Documents
Haithem Afli | Andy Way

Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement.

pdf bib
Language technology tools and resources for the analysis of multimodal communication
László Hunyadi | Tamás Váradi | István Szekrényes

In this paper we describe how the complexity of human communication can be analysed with the help of language technology. We present the HuComTech corpus, a multimodal corpus containing 50 hours of videotaped interviews containing a rich annotation of about 2 million items annotated on 33 levels. The corpus serves as a general resource for a wide range of re-search addressing natural conversation between humans in their full complexity. It can benefit particularly digital humanities researchers working in the field of pragmatics, conversational analysis and discourse analysis. We will present a number of tools and automated methods that can help such enquiries. In particular, we will highlight the tool Theme, which is designed to uncover hidden temporal patterns (called T-patterns) in human interaction, and will show how it can applied to the study of multimodal communication.

pdf bib
Large-scale Analysis of Spoken Free-verse Poetry
Timo Baumann | Burkhard Meyer-Sickendiek

Most modern and post-modern poems have developed a post-metrical idea of lyrical prosody that employs rhythmical features of everyday language and prose instead of a strict adherence to rhyme and metrical schemes. This development is subsumed under the term free verse prosody. We present our methodology for the large-scale analysis of modern and post-modern poetry in both their written form and as spoken aloud by the author. We employ language processing tools to align text and speech, to generate a null-model of how the poem would be spoken by a naïve reader, and to extract contrastive prosodic features used by the poet. On these, we intend to build our model of free verse prosody, which will help to understand, differentiate and relate the different styles of free verse poetry. We plan to use our processing scheme on large amounts of data to iteratively build models of styles, to validate and guide manual style annotation, to identify further rhythmical categories, and ultimately to broaden our understanding of free verse poetry. In this paper, we report on a proof-of-concept of our methodology using smaller amounts of poems and a limited set of features. We find that our methodology helps to extract differentiating features in the authors’ speech that can be explained by philological insight. Thus, our automatic method helps to guide the literary analysis and this in turn helps to improve our computational models.

pdf bib
PAT workbench: Annotation and Evaluation of Text and Pictures in Multimodal Instructions
Ielka van der Sluis | Lennart Kloppenburg | Gisela Redeker

This paper presents a tool to investigate the design of multimodal instructions (MIs), i.e., instructions that contain both text and pictures. The benefit of including pictures in information presentation has been established, but the characteristics of those pictures and of their textual counterparts and the rela-tion(s) between them have not been researched in a systematic manner. We present the PAT Work-bench, a tool to store, annotate and retrieve MIs based on a validated coding scheme with currently 42 categories that describe instructions in terms of textual features, pictorial elements, and relations be-tween text and pictures. We describe how the PAT Workbench facilitates collaborative annotation and inter-annotator agreement calculation. Future work on the tool includes expanding its functionality and usability by (i) making the MI annotation scheme dynamic for adding relevant features based on empirical evaluations of the MIs, (ii) implementing algorithms for automatic tagging of MI features, and (iii) implementing automatic MI evaluation algorithms based on results obtained via e.g. crowdsourced assessments of MIs.

pdf bib
Semantic Indexing of Multilingual Corpora and its Application on the History Domain
Alessandro Raganato | Jose Camacho-Collados | Antonio Raganato | Yunseo Joung

The increasing amount of multilingual text collections available in different domains makes its automatic processing essential for the development of a given field. However, standard processing techniques based on statistical clues and keyword searches have clear limitations. Instead, we propose a knowledge-based processing pipeline which overcomes most of the limitations of these techniques. This, in turn, enables direct comparison across texts in different languages without the need of translation. In this paper we show the potential of this approach for semantically indexing multilingual text collections in the history domain. In our experiments we used a version of the Bible translated in four different languages, evaluating the precision of our semantic indexing pipeline and showing its reliability on the cross-lingual text retrieval task.

pdf bib
Tagging Ingush - Language Technology For Low-Resource Languages Using Resources From Linguistic Field Work
Jörg Tiedemann | Johanna Nichols | Ronald Sprouse

This paper presents on-going work on creating NLP tools for under-resourced languages from very sparse training data coming from linguistic field work. In this work, we focus on Ingush, a Nakh-Daghestanian language spoken by about 300,000 people in the Russian republics Ingushetia and Chechnya. We present work on morphosyntactic taggers trained on transcribed and linguistically analyzed recordings and dependency parsers using English glosses to project annotation for creating synthetic treebanks. Our preliminary results are promising, supporting the goal of bootstrapping efficient NLP tools with limited or no task-specific annotated data resources available.

pdf bib
The MultiTal NLP tool infrastructure
Driss Sadoun | Satenik Mkhitaryan | Damien Nouvel | Mathieu Valette

This paper gives an overview of the MultiTal project, which aims to create a research infrastructure that ensures long-term distribution of NLP tools descriptions. The goal is to make NLP tools more accessible and usable to end-users of different disciplines. The infrastructure is built on a meta-data scheme modelling and standardising multilingual NLP tools documentation. The model is conceptualised using an OWL ontology. The formal representation of the ontology allows us to automatically generate organised and structured documentation in different languages for each represented tool.

pdf bib
Tools and Instruments for Building and Querying Diachronic Computational Lexica
Fahad Khan | Andrea Bellandi | Monica Monachini

This article describes work on enabling the addition of temporal information to senses of words in linguistic linked open data lexica based on the lemonDia model. Our contribution in this article is twofold. On the one hand, we demonstrate how lemonDia enables the querying of diachronic lexical datasets using OWL-oriented Semantic Web based technologies. On the other hand, we present a preliminary version of an interactive interface intended to help users in creating lexical datasets that model meaning change over time.

pdf bib
Tracking Words in Chinese Poetry of Tang and Song Dynasties with the China Biographical Database
Chao-Lin Liu | Kuo-Feng Luo

(This is the abstract for the submission.) Large-scale comparisons between the poetry of Tang and Song dynasties shed light on how words and expressions were used and shared among the poets. That some words were used only in the Tang poetry and some only in the Song poetry could lead to interesting research in linguistics. That the most frequent colors are different in the Tang and Song poetry provides a trace of the changing social circumstances in the dynasties. Results of the current work link to research topics of lexicography, semantics, and social transitions. We discuss our findings and present our algorithms for efficient comparisons among the poems, which are crucial for completing billion times of comparisons within acceptable time.

pdf bib
Using TEI for textbook research
Lena-Luise Stahn | Steffen Hennicke | Ernesto William De Luca

The following paper describes the first steps in the development of an ontology for the textbook research discipline. The aim of the project WorldViews is to establish a digital edition focussing on views of the world depicted in textbooks. For this purpose an initial TEI profile has been formalised and tested as a use case to enable the semantical encoding of the resource ‘textbook’. This profile shall provide a basic data model describing major facets of the textbook’s structure relevant to historians.

pdf bib
Web services and data mining: combining linguistic tools for Polish with an analytical platform
Maciej Ogrodniczuk

In this paper we present a new combination of existing language tools for Polish with a popular data mining platform intended to help researchers from digital humanities perform computational analyses without any programming. The toolset includes RapidMiner Studio, a software solution offering graphical setup of integrated analytical processes and Multiservice, a Web service offering access to several state-of-the-art linguistic tools for Polish. The setting is verified in a simple task of counting frequencies of unknown words in a small corpus.

up

pdf (full)
bib (full)
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

pdf bib
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)
Dominique Brunato | Felice Dell’Orletta | Giulia Venturi | Thomas François | Philippe Blache

pdf bib
Could Machine Learning Shed Light on Natural Language Complexity?
Maria Dolores Jiménez-López | Leonor Becerra-Bonache

In this paper, we propose to use a subfield of machine learning –grammatical inference– to measure linguistic complexity from a developmental point of view. We focus on relative complexity by considering a child learner in the process of first language acquisition. The relevance of grammatical inference models for measuring linguistic complexity from a developmental point of view is based on the fact that algorithms proposed in this area can be considered computational models for studying first language acquisition. Even though it will be possible to use different techniques from the field of machine learning as computational models for dealing with linguistic complexity -since in any model we have algorithms that can learn from data-, we claim that grammatical inference models offer some advantages over other tools.

pdf bib
Towards a Distributional Model of Semantic Complexity
Emmanuele Chersoni | Philippe Blache | Alessandro Lenci

In this paper, we introduce for the first time a Distributional Model for computing semantic complexity, inspired by the general principles of the Memory, Unification and Control framework(Hagoort, 2013; Hagoort, 2016). We argue that sentence comprehension is an incremental process driven by the goal of constructing a coherent representation of the event represented by the sentence. The composition cost of a sentence depends on the semantic coherence of the event being constructed and on the activation degree of the linguistic constructions. We also report the results of a first evaluation of the model on the Bicknell dataset (Bicknell et al., 2010).

pdf bib
CoCoGen - Complexity Contour Generator: Automatic Assessment of Linguistic Complexity Using a Sliding-Window Technique
Ströbel Marcus | Elma Kerz | Daniel Wiechmann | Stella Neumann

We present a novel approach to the automatic assessment of text complexity based on a sliding-window technique that tracks the distribution of complexity within a text. Such distribution is captured by what we term “complexity contours” derived from a series of measurements for a given linguistic complexity measure. This approach is implemented in an automatic computational tool, CoCoGen – Complexity Contour Generator, which in its current version supports 32 indices of linguistic complexity. The goal of the paper is twofold: (1) to introduce the design of our computational tool based on a sliding-window technique and (2) to showcase this approach in the area of second language (L2) learning, i.e. more specifically, in the area of L2 writing.

pdf bib
Addressing surprisal deficiencies in reading time models
Marten van Schijndel | William Schuler

This study demonstrates a weakness in how n-gram and PCFG surprisal are used to predict reading times in eye-tracking data. In particular, the information conveyed by words skipped during saccades is not usually included in the surprisal measures. This study shows that correcting the surprisal calculation improves n-gram surprisal and that upcoming n-grams affect reading times, replicating previous findings of how lexical frequencies affect reading times. In contrast, the predictivity of PCFG surprisal does not benefit from the surprisal correction despite the fact that lexical sequences skipped by saccades are processed by readers, as demonstrated by the corrected n-gram measure. These results raise questions about the formulation of information-theoretic measures of syntactic processing such as PCFG surprisal and entropy reduction when applied to reading times.

pdf bib
Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts
Sowmya Vajjala | Detmar Meurers | Alexander Eitel | Katharina Scheiter

Computational approaches to readability assessment are generally built and evaluated using gold standard corpora labeled by publishers or teachers rather than being grounded in observations about human performance. Considering that both the reading process and the outcome can be observed, there is an empirical wealth that could be used to ground computational analysis of text readability. This will also support explicit readability models connecting text complexity and the reader’s language proficiency to the reading process and outcomes. This paper takes a step in this direction by reporting on an experiment to study how the relation between text complexity and reader’s language proficiency affects the reading process and performance outcomes of readers after reading We modeled the reading process using three eye tracking variables: fixation count, average fixation count, and second pass reading duration. Our models for these variables explained 78.9%, 74% and 67.4% variance, respectively. Performance outcome was modeled through recall and comprehension questions, and these models explained 58.9% and 27.6% of the variance, respectively. While the online models give us a better understanding of the cognitive correlates of reading with text complexity and language proficiency, modeling of the offline measures can be particularly relevant for incorporating user aspects into readability models.

pdf bib
Memory access during incremental sentence processing causes reading time latency
Cory Shain | Marten van Schijndel | Richard Futrell | Edward Gibson | William Schuler

Studies on the role of memory as a predictor of reading time latencies (1) differ in their predictions about when memory effects should occur in processing and (2) have had mixed results, with strong positive effects emerging from isolated constructed stimuli and weak or even negative effects emerging from naturally-occurring stimuli. Our study addresses these concerns by comparing several implementations of prominent sentence processing theories on an exploratory corpus and evaluating the most successful of these on a confirmatory corpus, using a new self-paced reading corpus of seemingly natural narratives constructed to contain an unusually high proportion of memory-intensive constructions. We show highly significant and complementary broad-coverage latency effects both for predictors based on the Dependency Locality Theory and for predictors based on a left-corner parsing model of sentence processing. Our results indicate that memory access during sentence processing does take time, but suggest that stimuli requiring many memory access events may be necessary in order to observe the effect.

pdf bib
Reducing lexical complexity as a tool to increase text accessibility for children with dyslexia
Núria Gala | Johannes Ziegler

Lexical complexity plays a central role in readability, particularly for dyslexic children and poor readers because of their slow and laborious decoding and word recognition skills. Although some features to aid readability may be common to most languages (e.g., the majority of ‘easy’ words are of low frequency), we believe that lexical complexity is mainly language-specific. In this paper, we define lexical complexity for French and we present a pilot study on the effects of text simplification in dyslexic children. The participants were asked to read out loud original and manually simplified versions of a standardized French text corpus and to answer comprehension questions after reading each text. The analysis of the results shows that the simplifications performed were beneficial in terms of reading speed and they reduced the number of reading errors (mainly lexical ones) without a loss in comprehension. Although the number of participants in this study was rather small (N=10), the results are promising and contribute to the development of applications in computational linguistics.

pdf bib
Syntactic and Lexical Complexity in Italian Noncanonical Structures
Rodolfo Delmonte

In this paper we will be dealing with different levels of complexity in the processing of Italian, a Romance language inheriting many properties from Latin which make it an almost free word order language . The paper is concerned with syntactic complexity as measurable on the basis of the cognitive parser that incrementally builds up a syntactic representation to be used by the semantic component. The theory behind will be LFG and parsing preferences will be used to justify one choice both from a principled and a processing point of view. LFG is a transformationless theory in which there is no deep structure separate from surface syntactic structure. This is partially in accordance with constructional theories in which noncanonical structures containing non-argument functions FOCUS/TOPIC are treated as multifunctional constituents. Complexity is computed on a processing basis following suggestions made by Blache and demonstrated by Kluender and Chesi

pdf bib
Real Multi-Sense or Pseudo Multi-Sense: An Approach to Improve Word Representation
Haoyue Shi | Caihua Li | Junfeng Hu

Previous researches have shown that learning multiple representations for polysemous words can improve the performance of word embeddings on many tasks. However, this leads to another problem. Several vectors of a word may actually point to the same meaning, namely pseudo multi-sense. In this paper, we introduce the concept of pseudo multi-sense, and then propose an algorithm to detect such cases. With the consideration of the detected pseudo multi-sense cases, we try to refine the existing word embeddings to eliminate the influence of pseudo multi-sense. Moreover, we apply our algorithm on previous released multi-sense word embeddings and tested it on artificial word similarity tasks and the analogy task. The result of the experiments shows that diminishing pseudo multi-sense can improve the quality of word representations. Thus, our method is actually an efficient way to reduce linguistic complexity.

pdf bib
A Preliminary Study of Statistically Predictive Syntactic Complexity Features and Manual Simplifications in Basque
Itziar Gonzalez-Dios | María Jesús Aranzabe | Arantza Díaz de Ilarraza

In this paper, we present a comparative analysis of statistically predictive syntactic features of complexity and the treatment of these features by humans when simplifying texts. To that end, we have used a list of the most five statistically predictive features obtained automatically and the Corpus of Basque Simplified Texts (CBST) to analyse how the syntactic phenomena in these features have been manually simplified. Our aim is to go beyond the descriptions of operations found in the corpus and relate the multidisciplinary findings to understand text complexity from different points of view. We also present some issues that can be important when analysing linguistic complexity.

pdf bib
Dynamic pause assessment of keystroke logged data for the detection of complexity in translation and monolingual text production
Arndt Heilmann | Stella Neumann

Pause analysis of key-stroke logged translations is a hallmark of process based translation studies. However, an exact definition of what a cognitively effortful pause during the translation process is has not been found yet (Saldanha and O’Brien, 2013). This paper investigates the design of a key-stroke and subject dependent identification system of cognitive effort to track complexity in translation with keystroke logging (cf. also (Dragsted, 2005) (Couto-Vale, in preparation)). It is an elastic measure that takes into account idiosyncratic pause duration of translators as well as further confounds such as bi-gram frequency, letter frequency and some motor tasks involved in writing. The method is compared to a common static threshold of 1000 ms in an analysis of cognitive effort during the translation of grammatical functions from English to German. Additionally, the results are triangulated with eye tracking data for further validation. The findings show that at least for smaller sets of data a dynamic pause assessment may lead to more accurate results than a generic static pause threshold of similar duration.

pdf bib
Implicit readability ranking using the latent variable of a Bayesian Probit model
Johan Falkenjack | Arne Jönsson

Data driven approaches to readability analysis for languages other than English has been plagued by a scarcity of suitable corpora. Often, relevant corpora consist only of easy-to-read texts with no rank information or empirical readability scores, making only binary approaches, such as classification, applicable. We propose a Bayesian, latent variable, approach to get the most out of these kinds of corpora. In this paper we present results on using such a model for readability ranking. The model is evaluated on a preliminary corpus of ranked student texts with encouraging results. We also assess the model by showing that it performs readability classification on par with a state of the art classifier while at the same being transparent enough to allow more sophisticated interpretations.

pdf bib
CTAP: A Web-Based Tool Supporting Automatic Complexity Analysis
Xiaobin Chen | Detmar Meurers

Informed by research on readability and language acquisition, computational linguists have developed sophisticated tools for the analysis of linguistic complexity. While some tools are starting to become accessible on the web, there still is a disconnect between the features that can in principle be identified based on state-of-the-art computational linguistic analysis, and the analyses a second language acquisition researcher, teacher, or textbook writer can readily obtain and visualize for their own collection of texts. This short paper presents a web-based tool development that aims to meet this challenge. The Common Text Analysis Platform (CTAP) is designed to support fully configurable linguistic feature extraction for a wide range of complexity analyses. It features a user-friendly interface, modularized and reusable analysis component integration, and flexible corpus and feature management. Building on the Unstructured Information Management framework (UIMA), CTAP readily supports integration of state-of-the-art NLP and complexity feature extraction maintaining modularization and reusability. CTAP thereby aims at providing a common platform for complexity analysis, encouraging research collaboration and sharing of feature extraction components—to jointly advance the state-of-the-art in complexity analysis in a form that readily supports real-life use by ordinary users.

pdf bib
Coursebook Texts as a Helping Hand for Classifying Linguistic Complexity in Language Learners’ Writings
Ildikó Pilán | David Alfter | Elena Volodina

We bring together knowledge from two different types of language learning data, texts learners read and texts they write, to improve linguistic complexity classification in the latter. Linguistic complexity in the foreign and second language learning context can be expressed in terms of proficiency levels. We show that incorporating features capturing lexical complexity information from reading passages can boost significantly the machine learning based classification of learner-written texts into proficiency levels. With an F1 score of .8 our system rivals state-of-the-art results reported for other languages for this task. Finally, we present a freely available web-based tool for proficiency level classification and lexical complexity visualization for both learner writings and reading texts.

pdf bib
Using Ambiguity Detection to Streamline Linguistic Annotation
Wajdi Zaghouani | Abdelati Hawwari | Sawsan Alqahtani | Houda Bouamor | Mahmoud Ghoneim | Mona Diab | Kemal Oflazer

Arabic writing is typically underspecified for short vowels and other markups, referred to as diacritics. In addition to the lexical ambiguity exhibited in most languages, the lack of diacritics in written Arabic adds another layer of ambiguity which is an artifact of the orthography. In this paper, we present the details of three annotation experimental conditions designed to study the impact of automatic ambiguity detection, on annotation speed and quality in a large scale annotation project.

pdf bib
Morphological Complexity Influences Verb-Object Order in Swedish Sign Language
Johannes Bjerva | Carl Börstell

Computational linguistic approaches to sign languages could benefit from investigating how complexity influences structure. We investigate whether morphological complexity has an effect on the order of Verb (V) and Object (O) in Swedish Sign Language (SSL), on the basis of elicited data from five Deaf signers. We find a significant difference in the distribution of the orderings OV vs. VO, based on an analysis of morphological weight. While morphologically heavy verbs exhibit a general preference for OV, humanness seems to affect the ordering in the opposite direction, with [+human] Objects pushing towards a preference for VO.

pdf bib
A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora
Christian Bentz | Tatyana Ruzsics | Alexander Koplenig | Tanja Samardžić

Language complexity is an intriguing phenomenon argued to play an important role in both language learning and processing. The need to compare languages with regard to their complexity resulted in a multitude of approaches and methods, ranging from accounts targeting specific structural features to global quantification of variation more generally. In this paper, we investigate the degree to which morphological complexity measures are mutually correlated in a sample of more than 500 languages of 101 language families. We use human expert judgements from the World Atlas of Language Structures (WALS), and compare them to four quantitative measures automatically calculated from language corpora. These consist of three previously defined corpus-derived measures, which are all monolingual, and one new measure based on automatic word-alignment across pairs of languages. We find strong correlations between all the measures, illustrating that both expert judgements and automated approaches converge to similar complexity ratings, and can be used interchangeably.

pdf bib
Similarity-Based Alignment of Monolingual Corpora for Text Simplification Purposes
Sarah Albertsson | Evelina Rennes | Arne Jönsson

Comparable or parallel corpora are beneficial for many NLP tasks. The automatic collection of corpora enables large-scale resources, even for less-resourced languages, which in turn can be useful for deducing rules and patterns for text rewriting algorithms, a subtask of automatic text simplification. We present two methods for the alignment of Swedish easy-to-read text segments to text segments from a reference corpus. The first method (M1) was originally developed for the task of text reuse detection, measuring sentence similarity by a modified version of a TF-IDF vector space model. A second method (M2), also accounting for part-of-speech tags, was developed, and the methods were compared. For evaluation, a crowdsourcing platform was built for human judgement data collection, and preliminary results showed that cosine similarity relates better to human ranks than the Dice coefficient. We also saw a tendency that including syntactic context to the TF-IDF vector space model is beneficial for this kind of paraphrase alignment task.

pdf bib
Automatic Construction of Large Readability Corpora
Jorge Alberto Wagner Filho | Rodrigo Wilkens | Aline Villavicencio

This work presents a framework for the automatic construction of large Web corpora classified by readability level. We compare different Machine Learning classifiers for the task of readability assessment focusing on Portuguese and English texts, analysing the impact of variables like the feature inventory used in the resulting corpus. In a comparison between shallow and deeper features, the former already produce F-measures of over 0.75 for Portuguese texts, but the use of additional features results in even better results, in most cases. For English, shallow features also perform well as do classic readability formulas. Comparing different classifiers for the task, logistic regression obtained, in general, the best results, but with considerable differences between the results for two and those for three-classes, especially regarding the intermediary class. Given the large scale of the resulting corpus, for evaluation we adopt the agreement between different classifiers as an indication of readability assessment certainty. As a result of this work, a large corpus for Brazilian Portuguese was built, including 1.7 million documents and about 1.6 billion tokens, already parsed and annotated with 134 different textual attributes, along with the agreement among the various classifiers.

pdf bib
Testing the Processing Hypothesis of word order variation using a probabilistic language model
Jelke Bloem

This work investigates the application of a measure of surprisal to modeling a grammatical variation phenomenon between near-synonymous constructions. We investigate a particular variation phenomenon, word order variation in Dutch two-verb clusters, where it has been established that word order choice is affected by processing cost. Several multifactorial corpus studies of Dutch verb clusters have used other measures of processing complexity to show that this factor affects word order choice. This previous work allows us to compare the surprisal measure, which is based on constraint satisfaction theories of language modeling, to those previously used measures, which are more directly linked to empirical observations of processing complexity. Our results show that surprisal does not predict the word order choice by itself, but is a significant predictor when used in a measure of uniform information density (UID). This lends support to the view that human language processing is facilitated not so much by predictable sequences of words but more by sequences of words in which information is spread evenly.

pdf bib
Temporal Lobes as Combinatory Engines for both Form and Meaning
Jixing Li | Jonathan Brennan | Adam Mahar | John Hale

The relative contributions of meaning and form to sentence processing remains an outstanding issue across the language sciences. We examine this issue by formalizing four incremental complexity metrics and comparing them against freely-available ROI timecourses. Syntax-related metrics based on top-down parsing and structural dependency-distance turn out to significantly improve a regression model, compared to a simpler model that formalizes only conceptual combination using a distributional vector-space model. This confirms the view of the anterior temporal lobes as combinatory engines that deal in both form (see e.g. Brennan et al., 2012; Mazoyer, 1993) and meaning (see e.g., Patterson et al., 2007). This same characterization applies to a posterior temporal region in roughly “Wernicke’s Area.”

pdf bib
Automatic Speech Recognition Errors as a Predictor of L2 Listening Difficulties
Maryam Sadat Mirzaei | Kourosh Meshgi | Tatsuya Kawahara

This paper investigates the use of automatic speech recognition (ASR) errors as indicators of the second language (L2) learners’ listening difficulties and in doing so strives to overcome the shortcomings of Partial and Synchronized Caption (PSC) system. PSC is a system that generates a partial caption including difficult words detected based on high speech rate, low frequency, and specificity. To improve the choice of words in this system, and explore a better method to detect speech challenges, ASR errors were investigated as a model of the L2 listener, hypothesizing that some of these errors are similar to those of language learners’ when transcribing the videos. To investigate this hypothesis, ASR errors in transcription of several TED talks were analyzed and compared with PSC’s selected words. Both the overlapping and mismatching cases were analyzed to investigate possible improvement for the PSC system. Those ASR errors that were not detected by PSC as cases of learners’ difficulties were further analyzed and classified into four categories: homophones, minimal pairs, breached boundaries and negatives. These errors were embedded into the baseline PSC to make the enhanced version and were evaluated in an experiment with L2 learners. The results indicated that the enhanced version, which encompasses the ASR errors addresses most of the L2 learners’ difficulties and better assists them in comprehending challenging video segments as compared with the baseline.

pdf bib
Quantifying sentence complexity based on eye-tracking measures
Abhinav Deep Singh | Poojan Mehta | Samar Husain | Rajkumar Rajakrishnan

Eye-tracking reading times have been attested to reflect cognitive processes underlying sentence comprehension. However, the use of reading times in NLP applications is an underexplored area of research. In this initial work we build an automatic system to assess sentence complexity using automatically predicted eye-tracking reading time measures and demonstrate the efficacy of these reading times for a well known NLP task, namely, readability assessment. We use a machine learning model and a set of features known to be significant predictors of reading times in order to learn per-word reading times from a corpus of English text having reading times of human readers. Subsequently, we use the model to predict reading times for novel text in the context of the aforementioned task. A model based only on reading times gave competitive results compared to the systems that use extensive syntactic features to compute linguistic complexity. Our work, to the best of our knowledge, is the first study to show that automatically predicted reading times can successfully model the difficulty of a text and can be deployed in practical text processing applications.

pdf bib
Upper Bound of Entropy Rate Revisited —A New Extrapolation of Compressed Large-Scale Corpora—
Ryosuke Takahira | Kumiko Tanaka-Ishii | Łukasz Dębowski

The article presents results of entropy rate estimation for human languages across six languages by using large, state-of-the-art corpora of up to 7.8 gigabytes. To obtain the estimates for data length tending to infinity, we use an extrapolation function given by an ansatz. Whereas some ansatzes of this kind were proposed in previous research papers, here we introduce a stretched exponential extrapolation function that has a smaller error of fit. In this way, we uncover a possibility that the entropy rates of human languages are positive but 20% smaller than previously reported.

pdf bib
Learning pressures reduce morphological complexity: Linking corpus, computational and experimental evidence
Christian Bentz | Aleksandrs Berdicevskis

The morphological complexity of languages differs widely and changes over time. Pathways of change are often driven by the interplay of multiple competing factors, and are hard to disentangle. We here focus on a paradigmatic scenario of language change: the reduction of morphological complexity from Latin towards the Romance languages. To establish a causal explanation for this phenomenon, we employ three lines of evidence: 1) analyses of parallel corpora to measure the complexity of words in actual language production, 2) applications of NLP tools to further tease apart the contribution of inflectional morphology to word complexity, and 3) experimental data from artificial language learning, which illustrate the learning pressures at play when morphology simplifies. These three lines of evidence converge to show that pressures associated with imperfect language learning are good candidates to causally explain the reduction in morphological complexity in the Latin-to-Romance scenario. More generally, we argue that combining corpus, computational and experimental evidence is the way forward in historical linguistics and linguistic typology.

up

pdf (full)
bib (full)
Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP)

pdf bib
Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP)
Anna Rumshisky | Kirk Roberts | Steven Bethard | Tristan Naumann

pdf bib
The impact of simple feature engineering in multilingual medical NER
Rebecka Weegar | Arantza Casillas | Arantza Diaz de Ilarraza | Maite Oronoz | Alicia Pérez | Koldo Gojenola

The goal of this paper is to examine the impact of simple feature engineering mechanisms before applying more sophisticated techniques to the task of medical NER. Sometimes papers using scientifically sound techniques present raw baselines that could be improved adding simple and cheap features. This work focuses on entity recognition for the clinical domain for three languages: English, Swedish and Spanish. The task is tackled using simple features, starting from the window size, capitalization, prefixes, and moving to POS and semantic tags. This work demonstrates that a simple initial step of feature engineering can improve the baseline results significantly. Hence, the contributions of this paper are: first, a short list of guidelines well supported with experimental results on three languages and, second, a detailed description of the relevance of these features for medical NER.

pdf bib
Bidirectional LSTM-CRF for Clinical Concept Extraction
Raghavendra Chalapathy | Ehsan Zare Borzeshi | Massimo Piccardi

Automated extraction of concepts from patient clinical records is an essential facilitator of clinical research. For this reason, the 2010 i2b2/VA Natural Language Processing Challenges for Clinical Records introduced a concept extraction task aimed at identifying and classifying concepts into predefined categories (i.e., treatments, tests and problems). State-of-the-art concept extraction approaches heavily rely on handcrafted features and domain-specific resources which are hard to collect and define. For this reason, this paper proposes an alternative, streamlined approach: a recurrent neural network (the bidirectional LSTM with CRF decoding) initialized with general-purpose, off-the-shelf word embeddings. The experimental results achieved on the 2010 i2b2/VA reference corpora using the proposed framework outperform all recent methods and ranks closely to the best submission from the original 2010 i2b2/VA challenge.

pdf bib
MedNLPDoc: Japanese Shared Task for Clinical NLP
Eiji Aramaki | Yoshinobu Kano | Tomoko Ohkuma | Mizuki Morita

Due to the recent replacements of physical documents with electronic medical records (EMR), the importance of information processing in medical fields has been increased. We have been organizing the MedNLP task series in NTCIR-10 and 11. These workshops were the first shared tasks which attempt to evaluate technologies that retrieve important information from medical reports written in Japanese. In this report, we describe the NTCIR-12 MedNLPDoc task which is designed for more advanced and practical use for the medical fields. This task is considered as a multi-labeling task to a patient record. This report presents results of the shared task, discusses and illustrates remained issues in the medical natural language processing field.

pdf bib
Feature-Augmented Neural Networks for Patient Note De-identification
Ji Young Lee | Franck Dernoncourt | Özlem Uzuner | Peter Szolovits

Patient notes contain a wealth of information of potentially great interest to medical investigators. However, to protect patients’ privacy, Protected Health Information (PHI) must be removed from the patient notes before they can be legally released, a process known as patient note de-identification. The main objective for a de-identification system is to have the highest possible recall. Recently, the first neural-network-based de-identification system has been proposed, yielding state-of-the-art results. Unlike other systems, it does not rely on human-engineered features, which allows it to be quickly deployed, but does not leverage knowledge from human experts or from electronic health records (EHRs). In this work, we explore a method to incorporate human-engineered features as well as features derived from EHRs to a neural-network-based de-identification system. Our results show that the addition of features, especially the EHR-derived features, further improves the state-of-the-art in patient note de-identification, including for some of the most sensitive PHI types such as patient names. Since in a real-life setting patient notes typically come with EHRs, we recommend developers of de-identification systems to leverage the information EHRs contain.

pdf bib
Semi-supervised Clustering of Medical Text
Pracheta Sahoo | Asif Ekbal | Sriparna Saha | Diego Mollá | Kaushik Nandan

Semi-supervised clustering is an attractive alternative for traditional (unsupervised) clustering in targeted applications. By using the information of a small annotated dataset, semi-supervised clustering can produce clusters that are customized to the application domain. In this paper, we present a semi-supervised clustering technique based on a multi-objective evolutionary algorithm (NSGA-II-clus). We apply this technique to the task of clustering medical publications for Evidence Based Medicine (EBM) and observe an improvement of the results against unsupervised and other semi-supervised clustering techniques.

pdf bib
Deep Learning Architecture for Patient Data De-identification in Clinical Records
Shweta Yadav | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya

Rapid growth in Electronic Medical Records (EMR) has emerged to an expansion of data in the clinical domain. The majority of the available health care information is sealed in the form of narrative documents which form the rich source of clinical information. Text mining of such clinical records has gained huge attention in various medical applications like treatment and decision making. However, medical records enclose patient Private Health Information (PHI) which can reveal the identities of the patients. In order to retain the privacy of patients, it is mandatory to remove all the PHI information prior to making it publicly available. The aim is to de-identify or encrypt the PHI from the patient medical records. In this paper, we propose an algorithm based on deep learning architecture to solve this problem. We perform de-identification of seven PHI terms from the clinical records. Experiments on benchmark datasets show that our proposed approach achieves encouraging performance, which is better than the baseline model developed with Conditional Random Field.

pdf bib
Neural Clinical Paraphrase Generation with Attention
Sadid A. Hasan | Bo Liu | Joey Liu | Ashequl Qadir | Kathy Lee | Vivek Datla | Aaditya Prakash | Oladimeji Farri

Paraphrase generation is important in various applications such as search, summarization, and question answering due to its ability to generate textual alternatives while keeping the overall meaning intact. Clinical paraphrase generation is especially vital in building patient-centric clinical decision support (CDS) applications where users are able to understand complex clinical jargons via easily comprehensible alternative paraphrases. This paper presents Neural Clinical Paraphrase Generation (NCPG), a novel approach that casts the task as a monolingual neural machine translation (NMT) problem. We propose an end-to-end neural network built on an attention-based bidirectional Recurrent Neural Network (RNN) architecture with an encoder-decoder framework to perform the task. Conventional bilingual NMT models mostly rely on word-level modeling and are often limited by out-of-vocabulary (OOV) issues. In contrast, we represent the source and target paraphrase pairs as character sequences to address this limitation. To the best of our knowledge, this is the first work that uses attention-based RNNs for clinical paraphrase generation and also proposes an end-to-end character-level modeling for this task. Extensive experiments on a large curated clinical paraphrase corpus show that the attention-based NCPG models achieve improvements of up to 5.2 BLEU points and 0.5 METEOR points over a non-attention based strong baseline for word-level modeling, whereas further gains of up to 6.1 BLEU points and 1.3 METEOR points are obtained by the character-level NCPG models over their word-level counterparts. Overall, our models demonstrate comparable performance relative to the state-of-the-art phrase-based non-neural models.

pdf bib
Assessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP
Kirk Roberts

The proliferation of deep learning methods in natural language processing (NLP) and the large amounts of data they often require stands in stark contrast to the relatively data-poor clinical NLP domain. In particular, large text corpora are necessary to build high-quality word embeddings, yet often large corpora that are suitably representative of the target clinical data are unavailable. This forces a choice between building embeddings from small clinical corpora and less representative, larger corpora. This paper explores this trade-off, as well as intermediate compromise solutions. Two standard clinical NLP tasks (the i2b2 2010 concept and assertion tasks) are evaluated with commonly used deep learning models (recurrent neural networks and convolutional neural networks) using a set of six corpora ranging from the target i2b2 data to large open-domain datasets. While combinations of corpora are generally found to work best, the single-best corpus is generally task-dependent.

pdf bib
Inference of ICD Codes from Japanese Medical Records by Searching Disease Names
Masahito Sakishita | Yoshinobu Kano

Importance of utilizing medical information is getting increased as electronic health records (EHRs) are widely used nowadays. We aim to assign international standardized disease codes, ICD-10, to Japanese textual information in EHRs for users to reuse the information accurately. In this paper, we propose methods to automatically extract diagnosis and to assign ICD codes to Japanese medical records. Due to the lack of available training data, we dare employed rule-based methods rather than machine learning. We observed characteristics of medical records carefully, writing rules to make effective methods by hand. We applied our system to the NTCIR-12 MedNLPDoc shared task data where participants are required to assign ICD-10 codes of possible diagnosis in given EHRs. In this shared task, our system achieved the highest F-measure score among all participants in the most severe evaluation criteria. Through comparison with other approaches, we show that our approach could be a useful milestone for the future development of Japanese medical record processing.

pdf bib
A fine-grained corpus annotation schema of German nephrology records
Roland Roller | Hans Uszkoreit | Feiyu Xu | Laura Seiffe | Michael Mikhailov | Oliver Staeck | Klemens Budde | Fabian Halleck | Danilo Schmidt

In this work we present a fine-grained annotation schema to detect named entities in German clinical data of chronically ill patients with kidney diseases. The annotation schema is driven by the needs of our clinical partners and the linguistic aspects of German language. In order to generate annotations within a short period, the work also presents a semi-automatic annotation which uses additional sources of knowledge such as UMLS, to pre-annotate concepts in advance. The presented schema will be used to apply novel techniques from natural language processing and machine learning to support doctors treating their patients by improved information access from unstructured German texts.

pdf bib
Detecting Japanese Patients with Alzheimer’s Disease based on Word Category Frequencies
Daisaku Shibata | Shoko Wakamiya | Ayae Kinoshita | Eiji Aramaki

In recent years, detecting Alzheimer disease (AD) in early stages based on natural language processing (NLP) has drawn much attention. To date, vocabulary size, grammatical complexity, and fluency have been studied using NLP metrics. However, the content analysis of AD narratives is still unreachable for NLP. This study investigates features of the words that AD patients use in their spoken language. After recruiting 18 examinees of 53–90 years old (mean: 76.89), they were divided into two groups based on MMSE scores. The AD group comprised 9 examinees with scores of 21 or lower. The healthy control group comprised 9 examinees with a score of 22 or higher. Linguistic Inquiry and Word Count (LIWC) classified words were used to categorize the words that the examinees used. The word frequency was found from observation. Significant differences were confirmed for the usage of impersonal pronouns in the AD group. This result demonstrated the basic feasibility of the proposed NLP-based detection approach.

pdf bib
Prediction of Key Patient Outcome from Sentence and Word of Medical Text Records
Takanori Yamashita | Yoshifumi Wakata | Hidehisa Soejima | Naoki Nakashima | Sachio Hirokawa

The number of unstructured medical records kept in hospital information systems is increasing. The conditions of patients are formulated as outcomes in clinical pathway. A variance of an outcome describes deviations from standards of care like a patient’s bad condition. The present paper applied text mining to extract feature words and phrases of the variance from admission records. We report the cases the variances of “pain control” and “no neuropathy worsening” in cerebral infarction.

pdf bib
Unsupervised Abbreviation Detection in Clinical Narratives
Markus Kreuzthaler | Michel Oleynik | Alexander Avian | Stefan Schulz

Clinical narratives in electronic health record systems are a rich resource of patient-based information. They constitute an ongoing challenge for natural language processing, due to their high compactness and abundance of short forms. German medical texts exhibit numerous ad-hoc abbreviations that terminate with a period character. The disambiguation of period characters is therefore an important task for sentence and abbreviation detection. This task is addressed by a combination of co-occurrence information of word types with trailing period characters, a large domain dictionary, and a simple rule engine, thus merging statistical and dictionary-based disambiguation strategies. An F-measure of 0.95 could be reached by using the unsupervised approach presented in this paper. The results are promising for a domain-independent abbreviation detection strategy, because our approach avoids retraining of models or use case specific feature engineering efforts required for supervised machine learning approaches.

pdf bib
Automated Anonymization as Spelling Variant Detection
Steven Kester Yuwono | Hwee Tou Ng | Kee Yuan Ngiam

The issue of privacy has always been a concern when clinical texts are used for research purposes. Personal health information (PHI) (such as name and identification number) needs to be removed so that patients cannot be identified. Manual anonymization is not feasible due to the large number of clinical texts to be anonymized. In this paper, we tackle the task of anonymizing clinical texts written in sentence fragments and which frequently contain symbols, abbreviations, and misspelled words. Our clinical texts therefore differ from those in the i2b2 shared tasks which are in prose form with complete sentences. Our clinical texts are also part of a structured database which contains patient name and identification number in structured fields. As such, we formulate our anonymization task as spelling variant detection, exploiting patients’ personal information in the structured fields to detect their spelling variants in clinical texts. We successfully anonymized clinical texts consisting of more than 200 million words, using minimum edit distance and regular expression patterns.

up

pdf (full)
bib (full)
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

pdf bib
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)
Malvina Nissim | Viviana Patti | Barbara Plank

pdf bib
Zooming in on Gender Differences in Social Media
Aparna Garimella | Rada Mihalcea

Men are from Mars and women are from Venus - or so the genre of relationship literature would have us believe. But there is some truth in this idea, and researchers in fields as diverse as psychology, sociology, and linguistics have explored ways to better understand the differences between genders. In this paper, we take another look at the problem of gender discrimination and attempt to move beyond the typical surface-level text classification approach, by (1) identifying semantic and psycholinguistic word classes that reflect systematic differences between men and women and (2) finding differences between genders in the ways they use the same words. We describe several experiments and report results on a large collection of blogs authored by men and women.

pdf bib
The Effect of Gender and Age Differences on the Recognition of Emotions from Facial Expressions
Daniela Schneevogt | Patrizia Paggio

Recent studies have demonstrated gender and cultural differences in the recognition of emotions in facial expressions. However, most studies were conducted on American subjects. In this paper, we explore the generalizability of several findings to a non-American culture in the form of Danish subjects. We conduct an emotion recognition task followed by two stereotype questionnaires with different genders and age groups. While recent findings (Krems et al., 2015) suggest that women are biased to see anger in neutral facial expressions posed by females, in our sample both genders assign higher ratings of anger to all emotions expressed by females. Furthermore, we demonstrate an effect of gender on the fear-surprise-confusion observed by Tomkins and McCarter (1964); females overpredict fear, while males overpredict surprise.

pdf bib
A Recurrent and Compositional Model for Personality Trait Recognition from Short Texts
Fei Liu | Julien Perez | Scott Nowson

Many methods have been used to recognise author personality traits from text, typically combining linguistic feature engineering with shallow learning models, e.g. linear regression or Support Vector Machines. This work uses deep-learning-based models and atomic features of text, the characters, to build hierarchical, vectorial word and sentence representations for trait inference. This method, applied to a corpus of tweets, shows state-of-the-art performance across five traits compared with prior work. The results, supported by preliminary visualisation work, are encouraging for the ability to detect complex human traits.

pdf bib
Distant supervision for emotion detection using Facebook reactions
Chris Pool | Malvina Nissim

We exploit the Facebook reaction feature in a distant supervised fashion to train a support vector machine classifier for emotion detection, using several feature combinations and combining different Facebook pages. We test our models on existing benchmarks for emotion detection and show that employing only information that is derived completely automatically, thus without relying on any handcrafted lexicon as it’s usually done, we can achieve competitive results. The results also show that there is large room for improvement, especially by gearing the collection of Facebook pages, with a view to the target domain.

pdf bib
A graphical framework to detect and categorize diverse opinions from online news
Ankan Mullick | Pawan Goyal | Niloy Ganguly

This paper proposes a graphical framework to extract opinionated sentences which highlight different contexts within a given news article by introducing the concept of diversity in a graphical model for opinion detection.We conduct extensive evaluations and find that the proposed modification leads to impressive improvement in performance and makes the final results of the model much more usable. The proposed method (OP-D) not only performs much better than the other techniques used for opinion detection as well as introducing diversity, but is also able to select opinions from different categories (Asher et al. 2009). By developing a classification model which categorizes the identified sentences into various opinion categories, we find that OP-D is able to push opinions from different categories uniformly among the top opinions.

pdf bib
Active learning for detection of stance components
Maria Skeppstedt | Magnus Sahlgren | Carita Paradis | Andreas Kerren

Automatic detection of five language components, which are all relevant for expressing opinions and for stance taking, was studied: positive sentiment, negative sentiment, speculation, contrast and condition. A resource-aware approach was taken, which included manual annotation of 500 training samples and the use of limited lexical resources. Active learning was compared to random selection of training data, as well as to a lexicon-based method. Active learning was successful for the categories speculation, contrast and condition, but not for the two sentiment categories, for which results achieved when using active learning were similar to those achieved when applying a random selection of training data. This difference is likely due to a larger variation in how sentiment is expressed than in how speakers express the other three categories. This larger variation was also shown by the lower recall results achieved by the lexicon-based approach for sentiment than for the categories speculation, contrast and condition.

pdf bib
Detecting Opinion Polarities using Kernel Methods
Rasoul Kaljahi | Jennifer Foster

We investigate the application of kernel methods to representing both structural and lexical knowledge for predicting polarity of opinions in consumer product review. We introduce any-gram kernels which model lexical information in a significantly faster way than the traditional n-gram features, while capturing all possible orders of n-grams n in a sequence without the need to explicitly present a pre-specified set of such orders. We also present an effective format to represent constituency and dependency structure together with aspect terms and sentiment polarity scores. Furthermore, we modify the traditional tree kernel function to compute the similarity based on word embedding vectors instead of exact string match and present experiments using the new models.

pdf bib
Effects of Semantic Relatedness between Setups and Punchlines in Twitter Hashtag Games
Andrew Cattle | Xiaojuan Ma

This paper explores humour recognition for Twitter-based hashtag games. Given their popularity, frequency, and relatively formulaic nature, these games make a good target for computational humour research and can leverage Twitter likes and retweets as humour judgments. In this work, we use pair-wise relative humour judgments to examine several measures of semantic relatedness between setups and punchlines on a hashtag game corpus we collected and annotated. Results show that perplexity, Normalized Google Distance, and free-word association-based features are all useful in identifying “funnier” hashtag game responses. In fact, we provide empirical evidence that funnier punchlines tend to be more obscure, although more obscure punchlines are not necessarily rated funnier. Furthermore, the asymmetric nature of free-word association features allows us to see that while punchlines should be harder to predict given a setup, they should also be relatively easy to understand in context.

pdf bib
Generating Sentiment Lexicons for German Twitter
Uladzimir Sidarenka | Manfred Stede

Despite a substantial progress made in developing new sentiment lexicon generation (SLG) methods for English, the task of transferring these approaches to other languages and domains in a sound way still remains open. In this paper, we contribute to the solution of this problem by systematically comparing semi-automatic translations of common English polarity lists with the results of the original automatic SLG algorithms, which were applied directly to German data. We evaluate these lexicons on a corpus of 7,992 manually annotated tweets. In addition to that, we also collate the results of dictionary- and corpus-based SLG methods in order to find out which of these paradigms is better suited for the inherently noisy domain of social media. Our experiments show that semi-automatic translations notably outperform automatic systems (reaching a macro-averaged F1-score of 0.589), and that dictionary-based techniques produce much better polarity lists as compared to corpus-based approaches (whose best F1-scores run up to 0.479 and 0.419 respectively) even for the non-standard Twitter genre.

pdf bib
Innovative Semi-Automatic Methodology to Annotate Emotional Corpora
Lea Canales | Carlo Strapparava | Ester Boldrini | Patricio Martínez-Barco

Detecting depression or personality traits, tutoring and student behaviour systems, or identifying cases of cyber-bulling are a few of the wide range of the applications, in which the automatic detection of emotion is a crucial element. Emotion detection has the potential of high impact by contributing the benefit of business, society, politics or education. Given this context, the main objective of our research is to contribute to the resolution of one of the most important challenges in textual emotion detection task: the problems of emotional corpora annotation. This will be tackled by proposing of a new semi-automatic methodology. Our innovative methodology consists in two main phases: (1) an automatic process to pre-annotate the unlabelled sentences with a reduced number of emotional categories; and (2) a refinement manual process where human annotators will determine which is the predominant emotion between the emotional categories selected in the phase 1. Our proposal in this paper is to show and evaluate the pre-annotation process to analyse the feasibility and the benefits by the methodology proposed. The results obtained are promising and allow obtaining a substantial improvement of annotation time and cost and confirm the usefulness of our pre-annotation process to improve the annotation task.

pdf bib
Personality Estimation from Japanese Text
Koichi Kamijo | Tetsuya Nasukawa | Hideya Kitamura

We created a model to estimate personality trait from authors’ text written in Japanese and measured its performance by conducting surveys and analyzing the Twitter data of 1,630 users. We used the Big Five personality traits for personality trait estimation. Our approach is a combination of category- and Word2Vec-based approaches. For the category-based element, we added several unique Japanese categories along with the ones regularly used in the English model, and for the Word2Vec-based element, we used a model called GloVe. We found that some of the newly added categories have a stronger correlation with personality traits than other categories do and that the combination of the category- and Word2Vec-based approaches improves the accuracy of the personality trait estimation compared with the case of using just one of them.

pdf bib
Predicting Brexit: Classifying Agreement is Better than Sentiment and Pollsters
Fabio Celli | Evgeny Stepanov | Massimo Poesio | Giuseppe Riccardi

On June 23rd 2016, UK held the referendum which ratified the exit from the EU. While most of the traditional pollsters failed to forecast the final vote, there were online systems that hit the result with high accuracy using opinion mining techniques and big data. Starting one month before, we collected and monitored millions of posts about the referendum from social media conversations, and exploited Natural Language Processing techniques to predict the referendum outcome. In this paper we discuss the methods used by traditional pollsters and compare it to the predictions based on different opinion mining techniques. We find that opinion mining based on agreement/disagreement classification works better than opinion mining based on polarity classification in the forecast of the referendum outcome.

pdf bib
Sarcasm Detection : Building a Contextual Hierarchy
Taradheesh Bali | Navjyoti Singh

The conundrum of understanding and classifying sarcasm has been dealt with by the traditional theorists as an analysis of a sarcastic utterance and the ironic situation that surrounds it. The problem with such an approach is that it is too narrow, as it is unable to sufficiently utilize the two indispensable agents in making such an utterance, viz. the speaker and the listener. It undermines the necessary context required to comprehend a sarcastic utterance. In this paper, we propose a novel approach towards understanding sarcasm in terms of the existing knowledge hierarchy between the two participants, which forms the basis of the context that both agents share. The difference in relationship of the speaker of the sarcastic utterance and the disparate audience found on social media, such as Twitter, is also captured. We then apply our model on a corpus of tweets to achieve significant results and consequently, shed light on subjective nature of context, which is contingent on the relation between the speaker and the listener.

pdf bib
Social and linguistic behavior and its correlation to trait empathy
Marina Litvak | Jahna Otterbacher | Chee Siang Ang | David Atkins

A growing body of research exploits social media behaviors to gauge psychological character-istics, though trait empathy has received little attention. Because of its intimate link to the abil-ity to relate to others, our research aims to predict participants’ levels of empathy, given their textual and friending behaviors on Facebook. Using Poisson regression, we compared the vari-ance explained in Davis’ Interpersonal Reactivity Index (IRI) scores on four constructs (em-pathic concern, personal distress, fantasy, perspective taking), by two classes of variables: 1) post content and 2) linguistic style. Our study lays the groundwork for a greater understanding of empathy’s role in facilitating interactions on social media.

pdf bib
The Challenges of Multi-dimensional Sentiment Analysis Across Languages
Emily Öhman | Timo Honkela | Jörg Tiedemann

This paper outlines a pilot study on multi-dimensional and multilingual sentiment analysis of social media content. We use parallel corpora of movie subtitles as a proxy for colloquial language in social media channels and a multilingual emotion lexicon for fine-grained sentiment analyses. Parallel data sets make it possible to study the preservation of sentiments and emotions in translation and our assessment reveals that the lexical approach shows great inter-language agreement. However, our manual evaluation also suggests that the use of purely lexical methods is limited and further studies are necessary to pinpoint the cross-lingual differences and to develop better sentiment classifiers.

pdf bib
The Social Mood of News: Self-reported Annotations to Design Automatic Mood Detection Systems
Firoj Alam | Fabio Celli | Evgeny A. Stepanov | Arindam Ghosh | Giuseppe Riccardi

In this paper, we address the issue of automatic prediction of readers’ mood from newspaper articles and comments. As online newspapers are becoming more and more similar to social media platforms, users can provide affective feedback, such as mood and emotion. We have exploited the self-reported annotation of mood categories obtained from the metadata of the Italian online newspaper corriere.it to design and evaluate a system for predicting five different mood categories from news articles and comments: indignation, disappointment, worry, satisfaction, and amusement. The outcome of our experiments shows that overall, bag-of-word-ngrams perform better compared to all other feature sets; however, stylometric features perform better for the mood score prediction of articles. Our study shows that self-reported annotations can be used to design automatic mood prediction systems.

pdf bib
Microblog Emotion Classification by Computing Similarity in Text, Time, and Space
Anja Summa | Bernd Resch | Michael Strube

Most work in NLP analysing microblogs focuses on textual content thus neglecting temporal and spatial information. We present a new interdisciplinary method for emotion classification that combines linguistic, temporal, and spatial information into a single metric. We create a graph of labeled and unlabeled tweets that encodes the relations between neighboring tweets with respect to their emotion labels. Graph-based semi-supervised learning labels all tweets with an emotion.

pdf bib
A domain-agnostic approach for opinion prediction on speech
Pedro Bispo Santos | Lisa Beinborn | Iryna Gurevych

We explore a domain-agnostic approach for analyzing speech with the goal of opinion prediction. We represent the speech signal by mel-frequency cepstral coefficients and apply long short-term memory neural networks to automatically learn temporal regularities in speech. In contrast to previous work, our approach does not require complex feature engineering and works without textual transcripts. As a consequence, it can easily be applied on various speech analysis tasks for different languages and the results show that it can nevertheless be competitive to the state-of-the-art in opinion prediction. In a detailed error analysis for opinion mining we find that our approach performs well in identifying speaker-specific characteristics, but should be combined with additional information if subtle differences in the linguistic content need to be identified.

pdf bib
Can We Make Computers Laugh at Talks?
Chong Min Lee | Su-Youn Yoon | Lei Chen

Considering the importance of public speech skills, a system which makes a prediction on where audiences laugh in a talk can be helpful to a person who prepares for a talk. We investigated a possibility that a state-of-the-art humor recognition system can be used in detecting sentences inducing laughters in talks. In this study, we used TED talks and laughters in the talks as data. Our results showed that the state-of-the-art system needs to be improved in order to be used in a practical application. In addition, our analysis showed that classifying humorous sentences in talks is very challenging due to close distance between humorous and non-humorous sentences.

pdf bib
Towards Automatically Classifying Depressive Symptoms from Twitter Data for Population Health
Danielle L. Mowery | Albert Park | Craig Bryan | Mike Conway

Major depressive disorder, a debilitating and burdensome disease experienced by individuals worldwide, can be defined by several depressive symptoms (e.g., anhedonia (inability to feel pleasure), depressed mood, difficulty concentrating, etc.). Individuals often discuss their experiences with depression symptoms on public social media platforms like Twitter, providing a potentially useful data source for monitoring population-level mental health risk factors. In a step towards developing an automated method to estimate the prevalence of symptoms associated with major depressive disorder over time in the United States using Twitter, we developed classifiers for discerning whether a Twitter tweet represents no evidence of depression or evidence of depression. If there was evidence of depression, we then classified whether the tweet contained a depressive symptom and if so, which of three subtypes: depressed mood, disturbed sleep, or fatigue or loss of energy. We observed that the most accurate classifiers could predict classes with high-to-moderate F1-score performances for no evidence of depression (85), evidence of depression (52), and depressive symptoms (49). We report moderate F1-scores for depressive symptoms ranging from 75 (fatigue or loss of energy) to 43 (disturbed sleep) to 35 (depressed mood). Our work demonstrates baseline approaches for automatically encoding Twitter data with granular depressive symptoms associated with major depressive disorder.

up

pdf (full)
bib (full)
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)

pdf bib
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)
Key-Sun Choi | Christina Unger | Piek Vossen | Jin-Dong Kim | Noriko Kando | Axel-Cyrille Ngonga Ngomo

pdf bib
Using Wikipedia and Semantic Resources to Find Answer Types and Appropriate Answer Candidate Sets in Question Answering
Po-Chun Chen | Meng-Jie Zhuang | Chuan-Jie Lin

This paper proposes a new idea that uses Wikipedia categories as answer types and defines candidate sets inside Wikipedia. The focus of a given question is searched in the hierarchy of Wikipedia main pages. Our searching strategy combines head-noun matching and synonym matching provided in semantic resources. The set of answer candidates is determined by the entry hierarchy in Wikipedia and the hyponymy hierarchy in WordNet. The experimental results show that the approach can find candidate sets in a smaller size but achieve better performance especially for ARTIFACT and ORGANIZATION types, where the performance is better than state-of-the-art Chinese factoid QA systems.

pdf bib
Large-Scale Acquisition of Commonsense Knowledge via a Quiz Game on a Dialogue System
Naoki Otani | Daisuke Kawahara | Sadao Kurohashi | Nobuhiro Kaji | Manabu Sassano

Commonsense knowledge is essential for fully understanding language in many situations. We acquire large-scale commonsense knowledge from humans using a game with a purpose (GWAP) developed on a smartphone spoken dialogue system. We transform the manual knowledge acquisition process into an enjoyable quiz game and have collected over 150,000 unique commonsense facts by gathering the data of more than 70,000 players over eight months. In this paper, we present a simple method for maintaining the quality of acquired knowledge and an empirical analysis of the knowledge acquisition process. To the best of our knowledge, this is the first work to collect large-scale knowledge via a GWAP on a widely-used spoken dialogue system.

pdf bib
A Hierarchical Neural Network for Information Extraction of Product Attribute and Condition Sentences
Yukinori Homma | Kugatsu Sadamitsu | Kyosuke Nishida | Ryuichiro Higashinaka | Hisako Asano | Yoshihiro Matsuo

This paper describes a hierarchical neural network we propose for sentence classification to extract product information from product documents. The network classifies each sentence in a document into attribute and condition classes on the basis of word sequences and sentence sequences in the document. Experimental results showed the method using the proposed network significantly outperformed baseline methods by taking semantic representation of word and sentence sequential data into account. We also evaluated the network with two different product domains (insurance and tourism domains) and found that it was effective for both the domains.

pdf bib
Combining Lexical and Semantic-based Features for Answer Sentence Selection
Jing Shi | Jiaming Xu | Yiqun Yao | Suncong Zheng | Bo Xu

Question answering is always an attractive and challenging task in natural language processing area. There are some open domain question answering systems, such as IBM Waston, which take the unstructured text data as input, in some ways of humanlike thinking process and a mode of artificial intelligence. At the conference on Natural Language Processing and Chinese Computing (NLPCC) 2016, China Computer Federation hosted a shared task evaluation about Open Domain Question Answering. We achieve the 2nd place at the document-based subtask. In this paper, we present our solution, which consists of feature engineering in lexical and semantic aspects and model training methods. As the result of the evaluation shows, our solution provides a valuable and brief model which could be used in modelling question answering or sentence semantic relevance. We hope our solution would contribute to this vast and significant task with some heuristic thinking.

pdf bib
An Entity-Based approach to Answering Recurrent and Non-Recurrent Questions with Past Answers
Anietie Andy | Mugizi Rwebangira | Satoshi Sekine

An Entity-based approach to Answering recurrent and non-recurrent questions with Past Answers Abstract Community question answering (CQA) systems such as Yahoo! Answers allow registered-users to ask and answer questions in various question categories. However, a significant percentage of asked questions in Yahoo! Answers are unanswered. In this paper, we propose to reduce this percentage by reusing answers to past resolved questions from the site. Specifically, we propose to satisfy unanswered questions in entity rich categories by searching for and reusing the best answers to past resolved questions with shared needs. For unanswered questions that do not have a past resolved question with a shared need, we propose to use the best answer to a past resolved question with similar needs. Our experiments on a Yahoo! Answers dataset shows that our approach retrieves most of the past resolved questions that have shared and similar needs to unanswered questions.

pdf bib
Answer Presentation in Question Answering over Linked Data using Typed Dependency Subtree Patterns
Rivindu Perera | Parma Nand

In an era where highly accurate Question Answering (QA) systems are being built using complex Natural Language Processing (NLP) and Information Retrieval (IR) algorithms, presenting the acquired answer to the user akin to a human answer is also crucial. In this paper we present an answer presentation strategy by embedding the answer in a sentence which is developed by incorporating the linguistic structure of the source question extracted through typed dependency parsing. The evaluation using human participants proved that the methodology is human-competitive and can result in linguistically correct sentences for more that 70% of the test dataset acquired from QALD question dataset.

pdf bib
BioMedLAT Corpus: Annotation of the Lexical Answer Type for Biomedical Questions
Mariana Neves | Milena Kraus

Question answering (QA) systems need to provide exact answers for the questions that are posed to the system. However, this can only be achieved through a precise processing of the question. During this procedure, one important step is the detection of the expected type of answer that the system should provide by extracting the headword of the questions and identifying its semantic type. We have annotated the headword and assigned UMLS semantic types to 643 factoid/list questions from the BioASQ training data. We present statistics on the corpus and a preliminary evaluation in baseline experiments. We also discuss the challenges on both the manual annotation and the automatic detection of the headwords and the semantic types. We believe that this is a valuable resource for both training and evaluation of biomedical QA systems. The corpus is available at: https://github.com/mariananeves/BioMedLAT.

pdf bib
Double Topic Shifts in Open Domain Conversations: Natural Language Interface for a Wikipedia-based Robot Application
Kristiina Jokinen | Graham Wilcock

The paper describes topic shifting in dialogues with a robot that provides information from Wiki-pedia. The work focuses on a double topical construction of dialogue coherence which refers to discourse coherence on two levels: the evolution of dialogue topics via the interaction between the user and the robot system, and the creation of discourse topics via the content of the Wiki-pedia article itself. The user selects topics that are of interest to her, and the system builds a list of potential topics, anticipated to be the next topic, by the links in the article and by the keywords extracted from the article. The described system deals with Wikipedia articles, but could easily be adapted to other digital information providing systems.

pdf bib
Filling a Knowledge Graph with a Crowd
GyuHyeon Choi | Sangha Nam | Dongho Choi | Key-Sun Choi

pdf bib
Pairing Wikipedia Articles Across Languages
Marcus Klang | Pierre Nugues

Wikipedia has become a reference knowledge source for scores of NLP applications. One of its invaluable features lies in its multilingual nature, where articles on a same entity or concept can have from one to more than 200 different versions. The interlinking of language versions in Wikipedia has undergone a major renewal with the advent of Wikidata, a unified scheme to identify entities and their properties using unique numbers. However, as the interlinking is still manually carried out by thousands of editors across the globe, errors may creep in the assignment of entities. In this paper, we describe an optimization technique to match automatically language versions of articles, and hence entities, that is only based on bags of words and anchors. We created a dataset of all the articles on persons we extracted from Wikipedia in six languages: English, French, German, Russian, Spanish, and Swedish. We report a correct match of at least 94.3% on each pair.

pdf bib
SRDF: Extracting Lexical Knowledge Graph for Preserving Sentence Meaning
Sangha Nam | GyuHyeon Choi | Younggyun Hahm | Key-Sun Choi

In this paper, we present an open information extraction system so-called SRDF that generates lexical knowledge graphs from unstructured texts. In semantic web, knowledge is expressed in the RDF triple form but the natural language text consist of multiple relations between arguments. For this reason, we combine open information extraction with the reification for the full text extraction to preserve meaning of sentence in our knowledge graph. And also our knowledge graph is designed to adapt for many existing semantic web applications. At the end of this paper, we introduce the result of the experiment and a Korean template generation module developed using SRDF.

pdf bib
QAF: Frame Semantics-based Question Interpretation
Younggyun Hahm | Sangha Nam | Key-Sun Choi

Natural language questions are interpreted to a sequence of patterns to be matched with instances of patterns in a knowledge base (KB) for answering. A natural language (NL) question answering (QA) system utilizes meaningful patterns matching the syntac-tic/lexical features between the NL questions and KB. In the most of KBs, there are only binary relations in triple form to represent relation between two entities or entity and a value using the domain specific ontology. However, the binary relation representation is not enough to cover complex information in questions, and the ontology vocabulary sometimes does not cover the lexical meaning in questions. Complex meaning needs a knowledge representation to link the binary relation-type triples in KB. In this paper, we propose a frame semantics-based semantic parsing approach as KB-independent question pre-processing. We will propose requirements of question interpretation in the KBQA perspective, and a query form representation based on our proposed format QAF (Ques-tion Answering with the Frame Semantics), which is supposed to cover the requirements. In QAF, frame semantics roles as a model to represent complex information in questions and to disambiguate the lexical meaning in questions to match with the ontology vocabu-lary. Our system takes a question as an input and outputs QAF-query by the process which assigns semantic information in the question to its corresponding frame semantic structure using the semantic parsing rules.

pdf bib
Answering Yes-No Questions by Penalty Scoring in History Subjects of University Entrance Examinations
Yoshinobu Kano

Answering yes–no questions is more difficult than simply retrieving ranked search results. To answer yes–no questions, especially when the correct answer is no, one must find an objectionable keyword that makes the question’s answer no. Existing systems, such as factoid-based ones, cannot answer yes–no questions very well because of insufficient handling of such objectionable keywords. We suggest an algorithm that answers yes–no questions by assigning an importance to objectionable keywords. Concretely speaking, we suggest a penalized scoring method that finds and makes lower score for parts of documents that include such objectionable keywords. We check a keyword distribution for each part of a document such as a paragraph, calculating the keyword density as a basic score. Then we use an objectionable keyword penalty when a keyword does not appear in a target part but appears in other parts of the document. Our algorithm is robust for open domain problems because it requires no training. We achieved 4.45 point better results in F1 scores than the best score of the NTCIR-10 RITE2 shared task, also obtained the best score in 2014 mock university examination challenge of the Todai Robot project.

pdf bib
Dedicated Workflow Management for OKBQA Framework
Jiseong Kim | GyuHyeon Choi | Key-Sun Choi

Nowadays, a question answering (QA) system is used in various areas such a quiz show, personal assistant, home device, and so on. The OKBQA framework supports developing a QA system in an intuitive and collaborative ways. To support collaborative development, the framework should be equipped with some functions, e.g., flexible system configuration, debugging supports, intuitive user interface, and so on while considering different developing groups of different domains. This paper presents OKBQA controller, a dedicated workflow manager for OKBQA framework, to boost collaborative development of a QA system.

up

pdf (full)
bib (full)
Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)

pdf bib
Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)
Patrik Lambert | Bogdan Babych | Kurt Eberle | Rafael E. Banchs | Reinhard Rapp | Marta R. Costa-jussà

pdf bib
Combining fast_align with Hierarchical Sub-sentential Alignment for Better Word Alignments
Hao Wang | Yves Lepage

fast align is a simple and fast word alignment tool which is widely used in state-of-the-art machine translation systems. It yields comparable results in the end-to-end translation experiments of various language pairs. However, fast align does not perform as well as GIZA++ when applied to language pairs with distinct word orders, like English and Japanese. In this paper, given the lexical translation table output by fast align, we propose to realign words using the hierarchical sub-sentential alignment approach. Experimental results show that simple additional processing improves the performance of word alignment, which is measured by counting alignment matches in comparison with fast align. We also report the result of final machine translation in both English-Japanese and Japanese-English. We show our best system provided significant improvements over the baseline as measured by BLEU and RIBES.

pdf bib
Neural Network Language Models for Candidate Scoring in Hybrid Multi-System Machine Translation
Matīss Rikters

This paper presents the comparison of how using different neural network based language modeling tools for selecting the best candidate fragments affects the final output translation quality in a hybrid multi-system machine translation setup. Experiments were conducted by comparing perplexity and BLEU scores on common test cases using the same training data set. A 12-gram statistical language model was selected as a baseline to oppose three neural network based models of different characteristics. The models were integrated in a hybrid system that depends on the perplexity score of a sentence fragment to produce the best fitting translations. The results show a correlation between language model perplexity and BLEU scores as well as overall improvements in BLEU.

pdf bib
Image-Image Search for Comparable Corpora Construction
Yu Hong | Liang Yao | Mengyi Liu | Tongtao Zhang | Wenxuan Zhou | Jianmin Yao | Heng Ji

We present a novel method of comparable corpora construction. Unlike the traditional methods which heavily rely on linguistic features, our method only takes image similarity into consid-eration. We use an image-image search engine to obtain similar images, together with the cap-tions in source language and target language. On the basis, we utilize captions of similar imag-es to construct sentence-level bilingual corpora. Experiments on 10,371 target captions show that our method achieves a precision of 0.85 in the top search results.

pdf bib
Predicting Translation Equivalents in Linked WordNets
Krasimir Angelov | Gleb Lobanov

We present an algorithm for predicting translation equivalents between two languages, based on the corresponding WordNets. The assumption is that all synsets of one of the languages are linked to the corresponding synsets in the other language. In theory, given the exact sense of a word in a context it must be possible to translate it as any of the words in the linked synset. In practice, however, this does not work well since automatic and accurate sense disambiguation is difficult. Instead it is possible to define a more robust translation relation between the lexemes of the two languages. As far as we know the Finnish WordNet is the only one that includes that relation. Our algorithm can be used to predict the relation for other languages as well. This is useful for instance in hybrid machine translation systems which are usually more dependent on high-quality translation dictionaries.

pdf bib
Modifications of Machine Translation Evaluation Metrics by Using Word Embeddings
Haozhou Wang | Paola Merlo

Traditional machine translation evaluation metrics such as BLEU and WER have been widely used, but these metrics have poor correlations with human judgements because they badly represent word similarity and impose strict identity matching. In this paper, we propose some modifications to the traditional measures based on word embeddings for these two metrics. The evaluation results show that our modifications significantly improve their correlation with human judgements.

pdf bib
Verb sense disambiguation in Machine Translation
Roman Sudarikov | Ondřej Dušek | Martin Holub | Ondřej Bojar | Vincent Kríž

We describe experiments in Machine Translation using word sense disambiguation (WSD) information. This work focuses on WSD in verbs, based on two different approaches – verbal patterns based on corpus pattern analysis and verbal word senses from valency frames. We evaluate several options of using verb senses in the source-language sentences as an additional factor for the Moses statistical machine translation system. Our results show a statistically significant translation quality improvement in terms of the BLEU metric for the valency frames approach, but in manual evaluation, both WSD methods bring improvements.

pdf bib
Improving word alignment for low resource languages using English monolingual SRL
Meriem Beloucif | Markus Saers | Dekai Wu

We introduce a new statistical machine translation approach specifically geared to learning translation from low resource languages, that exploits monolingual English semantic parsing to bias inversion transduction grammar (ITG) induction. We show that in contrast to conventional statistical machine translation (SMT) training methods, which rely heavily on phrase memorization, our approach focuses on learning bilingual correlations that help translating low resource languages, by using the output language semantic structure to further narrow down ITG constraints. This approach is motivated by previous research which has shown that injecting a semantic frame based objective function while training SMT models improves the translation quality. We show that including a monolingual semantic objective function during the learning of the translation model leads towards a semantically driven alignment which is more efficient than simply tuning loglinear mixture weights against a semantic frame based evaluation metric in the final stage of statistical machine translation training. We test our approach with three different language pairs and demonstrate that our model biases the learning towards more semantically correct alignments. Both GIZA++ and ITG based techniques fail to capture meaningful bilingual constituents, which is required when trying to learn translation models for low resource languages. In contrast, our proposed model not only improve translation by injecting a monolingual objective function to learn bilingual correlations during early training of the translation model, but also helps to learn more meaningful correlations with a relatively small data set, leading to a better alignment compared to either conventional ITG or traditional GIZA++ based approaches.

pdf bib
Using Bilingual Segments in Generating Word-to-word Translations
Kavitha Mahesh | Gabriel Pereira Lopes | Luís Gomes

We defend that bilingual lexicons automatically extracted from parallel corpora, whose entries have been meanwhile validated by linguists and classified as correct or incorrect, should constitute a specific parallel corpora. And, in this paper, we propose to use word-to-word translations to learn morph-units (comprising of bilingual stems and suffixes) from those bilingual lexicons for two language pairs L1-L2 and L1-L3 to induce a bilingual lexicon for the language pair L2-L3, apart from also learning morph-units for this other language pair. The applicability of bilingual morph-units in L1-L2 and L1-L3 is examined from the perspective of pivot-based lexicon induction for language pair L2-L3 with L1 as bridge. While the lexicon is derived by transitivity, the correspondences are identified based on previously learnt bilingual stems and suffixes rather than surface translation forms. The induced pairs are validated using a binary classifier trained on morphological and similarity-based features using an existing, automatically acquired, manually validated bilingual translation lexicon for language pair L2-L3. In this paper, we discuss the use of English (EN)-French (FR) and English (EN)-Portuguese (PT) lexicon of word-to-word translations in generating word-to-word translations for the language pair FR-PT with EN as pivot language. Generated translations are filtered out first using an SVM-based FR-PT classifier and then are manually validated.

up

pdf (full)
bib (full)
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

pdf bib
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)
Toshiaki Nakazawa | Hideya Mino | Chenchen Ding | Isao Goto | Graham Neubig | Sadao Kurohashi | Ir. Hammam Riza | Pushpak Bhattacharyya

pdf bib
Overview of the 3rd Workshop on Asian Translation
Toshiaki Nakazawa | Chenchen Ding | Hideya Mino | Isao Goto | Graham Neubig | Sadao Kurohashi

This paper presents the results of the shared tasks from the 3rd workshop on Asian translation (WAT2016) including J ↔ E, J ↔ C scientific paper translation subtasks, C ↔ J, K ↔ J, E ↔ J patent translation subtasks, I ↔ E newswire subtasks and H ↔ E, H ↔ J mixed domain subtasks. For the WAT2016, 15 institutions participated in the shared tasks. About 500 translation results have been submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib
Translation of Patent Sentences with a Large Vocabulary of Technical Terms Using Neural Machine Translation
Zi Long | Takehito Utsuro | Tomoharu Mitsuhashi | Mikio Yamamoto

Neural machine translation (NMT), a new approach to machine translation, has achieved promising results comparable to those of traditional approaches such as statistical machine translation (SMT). Despite its recent success, NMT cannot handle a larger vocabulary because training complexity and decoding complexity proportionally increase with the number of target words. This problem becomes even more serious when translating patent documents, which contain many technical terms that are observed infrequently. In NMTs, words that are out of vocabulary are represented by a single unknown token. In this paper, we propose a method that enables NMT to translate patent sentences comprising a large vocabulary of technical terms. We train an NMT system on bilingual data wherein technical terms are replaced with technical term tokens; this allows it to translate most of the source sentences except technical terms. Further, we use it as a decoder to translate source sentences with technical term tokens and replace the tokens with technical term translations using SMT. We also use it to rerank the 1,000-best SMT translations on the basis of the average of the SMT score and that of the NMT rescoring of the translated sentences with technical term tokens. Our experiments on Japanese-Chinese patent sentences show that the proposed NMT system achieves a substantial improvement of up to 3.1 BLEU points and 2.3 RIBES points over traditional SMT systems and an improvement of approximately 0.6 BLEU points and 0.8 RIBES points over an equivalent NMT system without our proposed technique.

pdf bib
Japanese-English Machine Translation of Recipe Texts
Takayuki Sato | Jun Harashima | Mamoru Komachi

Concomitant with the globalization of food culture, demand for the recipes of specialty dishes has been increasing. The recent growth in recipe sharing websites and food blogs has resulted in numerous recipe texts being available for diverse foods in various languages. However, little work has been done on machine translation of recipe texts. In this paper, we address the task of translating recipes and investigate the advantages and disadvantages of traditional phrase-based statistical machine translation and more recent neural machine translation. Specifically, we translate Japanese recipes into English, analyze errors in the translated recipes, and discuss available room for improvements.

pdf bib
IIT Bombay’s English-Indonesian submission at WAT: Integrating Neural Language Models with SMT
Sandhya Singh | Anoop Kunchukuttan | Pushpak Bhattacharyya

This paper describes the IIT Bombay’s submission as a part of the shared task in WAT 2016 for English–Indonesian language pair. The results reported here are for both the direction of the language pair. Among the various approaches experimented, Operation Sequence Model (OSM) and Neural Language Model have been submitted for WAT. The OSM approach integrates translation and reordering process resulting in relatively improved translation. Similarly the neural experiment integrates Neural Language Model with Statistical Machine Translation (SMT) as a feature for translation. The Neural Probabilistic Language Model (NPLM) gave relatively high BLEU points for Indonesian to English translation system while the Neural Network Joint Model (NNJM) performed better for English to Indonesian direction of translation system. The results indicate improvement over the baseline Phrase-based SMT by 0.61 BLEU points for English-Indonesian system and 0.55 BLEU points for Indonesian-English translation system.

pdf bib
Domain Adaptation and Attention-Based Unknown Word Replacement in Chinese-to-Japanese Neural Machine Translation
Kazuma Hashimoto | Akiko Eriguchi | Yoshimasa Tsuruoka

This paper describes our UT-KAY system that participated in the Workshop on Asian Translation 2016. Based on an Attention-based Neural Machine Translation (ANMT) model, we build our system by incorporating a domain adaptation method for multiple domains and an attention-based unknown word replacement method. In experiments, we verify that the attention-based unknown word replacement method is effective in improving translation scores in Chinese-to-Japanese machine translation. We further show results of manual analysis on the replaced unknown words.

pdf bib
Global Pre-ordering for Improving Sublanguage Translation
Masaru Fuji | Masao Utiyama | Eiichiro Sumita | Yuji Matsumoto

When translating formal documents, capturing the sentence structure specific to the sublanguage is extremely necessary to obtain high-quality translations. This paper proposes a novel global reordering method with particular focus on long-distance reordering for capturing the global sentence structure of a sublanguage. The proposed method learns global reordering models from a non-annotated parallel corpus and works in conjunction with conventional syntactic reordering. Experimental results on the patent abstract sublanguage show substantial gains of more than 25 points in the RIBES metric and comparable BLEU scores both for Japanese-to-English and English-to-Japanese translations.

pdf bib
Neural Reordering Model Considering Phrase Translation and Word Alignment for Phrase-based Translation
Shin Kanouchi | Katsuhito Sudoh | Mamoru Komachi

This paper presents an improved lexicalized reordering model for phrase-based statistical machine translation using a deep neural network. Lexicalized reordering suffers from reordering ambiguity, data sparseness and noises in a phrase table. Previous neural reordering model is successful to solve the first and second problems but fails to address the third one. Therefore, we propose new features using phrase translation and word alignment to construct phrase vectors to handle inherently noisy phrase translation pairs. The experimental results show that our proposed method improves the accuracy of phrase reordering. We confirm that the proposed method works well with phrase pairs including NULL alignments.

pdf bib
System Description of bjtu_nlp Neural Machine Translation System
Shaotong Li | JinAn Xu | Yufeng Chen | Yujie Zhang

This paper presents our machine translation system that developed for the WAT2016 evalua-tion tasks of ja-en, ja-zh, en-ja, zh-ja, JPCja-en, JPCja-zh, JPCen-ja, JPCzh-ja. We build our system based on encoder–decoder framework by integrating recurrent neural network (RNN) and gate recurrent unit (GRU), and we also adopt an attention mechanism for solving the problem of information loss. Additionally, we propose a simple translation-specific approach to resolve the unknown word translation problem. Experimental results show that our system performs better than the baseline statistical machine translation (SMT) systems in each task. Moreover, it shows that our proposed approach of unknown word translation performs effec-tively improvement of translation results.

pdf bib
Translation systems and experimental results of the EHR group for WAT2016 tasks
Terumasa Ehara

System architecture, experimental settings and experimental results of the group for the WAT2016 tasks are described. We participate in six tasks: en-ja, zh-ja, JPCzh-ja, JPCko-ja, HINDENen-hi and HINDENhi-ja. Although the basic architecture of our sys-tems is PBSMT with reordering, several techniques are conducted. Especially, the system for the HINDENhi-ja task with pivoting by English uses the reordering technique. Be-cause Hindi and Japanese are both OV type languages and English is a VO type language, we can use reordering technique to the pivot language. We can improve BLEU score from 7.47 to 7.66 by the reordering technique for the sentence level pivoting of this task.

pdf bib
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016
Graham Neubig