Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages

Abbreviated Title: 
SPMRL-SANCL 2014
Call for Papers
Submission Deadline: 
2 May 2014
Event Dates: 
23 Aug 2014 - 29 Aug 2014
Location: 
Co-Located with Coling 2014
City: 
Dublin
Country: 
Ireland

Special track on the Syntactic Analysis of Non-Canonical Language
=================================================================

ENDORSED BY SIGPARSE

The SANCL special track will be part of the Joint Workshop on
Statistical Parsing of Morphologically Rich Languages and Syntactic
Analysis of Non-Canonical Languages - SPMRL-SANCL 2014

Co-located with COLING 2014, August 23 - 29 in Dublin, Ireland

Submission Deadline: May 02, 2014

Main workshop: http://www.spmrl.org/spmrl-sancl2014.html

SANCL Special Track: http://www.spmrl.org/sancl-posters2014.html

SANCL Poster submissions
========================
In addition to regular paper submissions, we solicit poster submissions
addressing the syntactic analysis of frequent phenomena of non-canonical
languages which are difficult to annotate and parse using conventional
annotation schemes. A case in point are the representation of verbless
utterances in a dependency scheme, the pros and cons of different
representations of disfluencies for statistical parsing, or the analysis
of complex hashtags which incorporate and merge different syntactic
arguments into one token.

Poster submissions should focus on one or more of the topics listed
below. They should either be submitted as a short paper (up to 7
single-column pages + references, to be included in the proceedings
and presented as a poster at the workshop) or be submitted as an
abstract (max. 500 words excluding examples/references, to be presented
as a poster at the workshop). Abstract submissions should sketch an
analysis for a given problem while short paper submissions should also
present at least preliminary experimental results showing the
feasibility of the approach.

Topics for poster submissions:

Unit of analysis
================
For canonical, written text the relevant unit for syntactic analysis
is defined by the sentence boundaries. In CMC (computer mediated
communication), on the other side, sentence boundaries are not always
marked in a systematic way, and for spoken language, we can not revert
to sentence boundaries at all. Decisions concerning the relevant unit of
analysis will influence corpus-linguistic research (e.g. measures like
sentence length, syntactic complexity) as well as parsing results. On
the token level, it is also not clear what should be used as the unit of
analysis. In spoken language as well as in conceptually spoken registers
like CMC, multiple tokens are often merged into one new token (2,4-6), or
long compound words are split into separate units (5). It is not yet clear
whether it is preferable to address these issues during preprocessing,
e.g. by tokenizing and normalising the text, or whether this would result
in a "lossy translation", as argued by Owoputi et al. 2013, which should
be avoided.

(1) @Hii_ImFruiity nuin much at all juss chillin waddup w yu ?
-- Owoputi et al. 2013: OCT27 data set

We ask for contributions on the optimal unit of analysis for non-canonical
languages which do not come already separated into sentence-like units
(e.g. spoken language, tweets, historical data), and for contributions
on best practices for tokenizing spoken language and CMC.

Elliptical structures and missing elements
==========================================
Non-canonical languages often include sentences where syntactic arguments
are not expressed at the surface level. This raises the question how
we can provide a meaningful analysis for these structures, especially
in a dependency grammar framework. One way to deal with the problem is
to insert missing predicates as dummy verbs into the tree to be able
to provide a dependency analysis for these structures (e.g. Seeker &
Kuhn 2012; Dipper, Lüdeling & Reznicek 2013, see NoSta-D annotation
guidelines). The question remains whether this approach is feasible
for automatic processing, especially for the highly underspecified and
ambiguous input often provided by NCLs, or whether a constituency-based
analysis offers more elegant means to analyse elliptical structures.

We ask for contributions discussing the optimal representation for
elliptical structures.

(2) Doesn't change the result though. -- From DCU's Football Treebank

Hashtags & friends
==================
Newly emerging text types from the Social Media have triggered new,
creative means of communication which help users to overcome the
limitations of expressing themselves in a written medium. Twitter hashtags
are one case in point, not only allowing the users to add a semantic tag
to their tweet, but also to add comments, context information, irony
and sarcasm, to express personal feelings, or to evaluate. Formally,
they are not bound to one particular part-of-speech but can include
whole phrases or sentences, which implies that the common practise to
tag them using the the label HASHTAG does not do them justice. This is
even more so the case for hashtags encoding one or more arguments of the
predicate, as in (10). Hashtags provide a rich source of information
which has already been exploited in sentiment analysis and opinion
mining (e.g. Mohammad et al. 2013, Kunneman et al 2013; also see
http://www.newyorker.com/online/blogs/susanorlean/2010/06/hash.html for
an overview of the different functions of hashtags). We are interested in
approaches towards a syntactic analysis of hashtags (and related phenomena
such as complex inflective constructions in German CMC (Schlobinski
2001)) which allow us to make better use of the information encoded in
hashtags. What are the new challenges for analysing these phenomena? What
can be learned from research on similar phenomena, e.g. on MWE?

(3) #itsnothebeer I don't like but the taste -- From Twitter

Disfluencies
============
Disfluencies (e.g. fillers, repairs) are a common phenomenon in spoken
language and also occur in written, but conceptually spoken language
such as CMC.

(4) He uh graduated from medical school this year and uh, I mean he's
in uh, ... Soho in New York.
-- SBC046, Du Bois et al. 2000: Santa Barbara corpus of spoken
American English

There are different ways of representing disfluencies. In the Switchboard
corpus, fillers are included in the tree, and for repairs, both the
repair and the reparandum are attached to the same node. In the German
Verbmobil treebank, fillers have been removed and so-called speech
errors and repetitions are not integrated in the tree but instead are
attached to the root node. The different representations are expected
to have an impact on statistical parsing as well as on the usefulness
of the resources for linguistic research.

We ask for contributions discussing the best way of representing
disfluencies in the syntax tree.

Code mixing
===========
In informal spoken language as well as in CMC, a considerable amount
of the data includes code mixing. This provides a huge challenge for
automatic processing, and even more so as there is no agreed upon
theoretical distinction between loanwords and foreign words. Should we
annotate foreign language material using the same annotation scheme as for
the target language, especially in cases where the grammatical differences
between the languages involved do not easily allow us to do so?

(5) es tut mir so leid vallah ich wollte kommen ama unuttum
it does me so harm my God I wanted come but forget-pst-1-sg
"I am so sorry, really, I wanted to come but I forgot"
-- From Twitter

We ask for contributions discussing best practices for the syntactic
analysis of code mixing.

For more examples and information, please visit:
http://www.spmrl.org/sancl-posters2014.html

SANCL Special Track Organizers

Ozlem Cetinoglu (IMS, Germany)
Ines Rehbein (Postdam University, Germany)
Djamé Seddah (Université Paris Sorbonne & Inria's Alpage project)
Joel Tetreault (Yahoo! Labs, US)