2017 Shared Task on Native Language Identification

Event Notification Type: 
Call for Participation
Abbreviated Title: 
Location: 
Friday, 8 September 2017
State: 
Country: 
Denmark
Contact Email: 
City: 
Copenhagen
Contact: 
2017 NLI Shared Task Organizers
Submission Deadline: 
Friday, 23 June 2017

(apologies for cross posting)

* Website: https://sites.google.com/site/nlisharedtask/home

DESCRIPTION

We are excited to organize a new shared task on Native Language Identification (NLI) which will take place at the BEA12 Workshop, co-located with EMNLP in Copenhagen, September 08, 2017.

NLI is the task of identifying the native language (L1) of a writer based solely on a sample of their writing or speech. The task is typically framed as a classification problem where the set of L1s is known a priori. Most work has focused on identifying the native language of writers learning English as a second language. Two previous shared tasks on NLI have been organized in which the task was to identify the native language of non-native speakers of English-based on essays and spoken responses they provided during a standardized assessment of academic English proficiency. The first shared task was based on the essays only and was also held with the BEA workshop in 2013. It was very successful with 29 teams competing, making it one of the largest shared tasks that year. Three years later, the Computational Paralinguistics Challenge at Interspeech 2016 hosted a sub-challenge on identifying the native language based solely on the spoken responses.

This year's shared task combines the inputs from the two previous tasks. There will be three tracks: NLI on the essay only, NLI on the spoken response only (based on a transcription of the response, not the audio), and NLI using both responses from a test taker. This distinction will make for a more challenging shared task while building on the methods and results from the previous two shared tasks. We promise this shared task will be fun for you and your colleagues, as well as your whole family.

DATA

Educational Testing Service (ETS) is releasing 13,200 English essays and orthographic transcriptions of 13,200 spoken responses from the TOEFL iBT® assessment for the 2017 NLI Shared Task with the goal of helping researchers advance state-of-the-art in the field of NLI. In addition to the orthographic transcriptions of the spoken responses, i-vectors generated from the audio files will be released as a baseline comparison for the speech-based NLI task (although the audio files themselves are not included in this data set). The data set contains test responses from 13,200 test takers (one essay and one spoken response transcription per test taker) and includes 11 native languages (L1s) with 1,200 test takers per L1. The 11 native languages covered by the corpus are: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. The essays typically range in length from approximately 300 to 400 words and the transcribed spoken responses typically contain approximately 100 words. Responses from 11,000 test takers in this set will be used as training data for the NLI Shared Task, 1,100 for development, and the remaining 1,100 will be released later as test data.

EVALUATION

The shared task will be composed of three sub-tasks:

Main Task: The first and main task will be the 11-way classification task using all available data sources
Text Task: 11-way classification solely using the essays
Speech Task: 11-way classification using solely the transcripts and/or i-vectors

REGISTRATION

Please register for the shared task via the following link:

https://docs.google.com/forms/d/e/1FAIpQLSdPjJLJxDJ8h1pUKI7yCDEUkW7saBne...

Next, in order to obtain the training and test data for the task, all participants must sign and return the data usage agreement form found here:

https://sites.google.com/site/nlisharedtask/data

IMPORTANT DATES

Mar 27 - Training Data Release (Phase 1: Text)
Mid April - Training Data Release (Phase 2: Speech Transcripts and iVectors)
Jun 19 - Test Data Release
Jun 26 - Results Notification
Jul 05 - Draft System Description Papers Due
Jul 14 - Camera Ready Papers Due
Sep 08 - BEA12 Workshop

ORGANIZERS

Aoife Cahill (Educational Testing Service)
Keelan Evanini (Educational Testing Service)
Shervin Malmasi (Harvard Medical School)
Joel Tetreault (Grammarly)

Contact email: nli.sharedtask@gmail.com