Building A Case-based Semantic English-Chinese Parallel Treebank

Huaxing Shi, Tiejun Zhao, Keh-Yih Su


Abstract
We construct a case-based English-to-Chinese semantic constituent parallel Treebank for a Statistical Machine Translation (SMT) task by labelling each node of the Deep Syntactic Tree (DST) with our refined semantic cases. Since subtree span-crossing is harmful in tree-based SMT, DST is adopted to alleviate this problem. At the same time, we tailor an existing case set to represent bilingual shallow semantic relations more precisely. This Treebank is a part of a semantic corpus building project, which aims to build a semantic bilingual corpus annotated with syntactic, semantic cases and word senses. Data in our Treebank is from the news domain of Datum corpus. 4,000 sentence pairs are selected to cover various lexicons and part-of-speech (POS) n-gram patterns as much as possible. This paper presents the construction of this case Treebank. Also, we have tested the effect of adopting DST structure in alleviating subtree span-crossing. Our preliminary analysis shows that the compatibility between Chinese and English trees can be significantly increased by transforming the parse-tree into the DST. Furthermore, the human agreement rate in annotation is found to be acceptable (90% in English nodes, 75% in Chinese nodes).
Anthology ID:
L16-1466
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2918–2924
Language:
URL:
https://aclanthology.org/L16-1466
DOI:
Bibkey:
Cite (ACL):
Huaxing Shi, Tiejun Zhao, and Keh-Yih Su. 2016. Building A Case-based Semantic English-Chinese Parallel Treebank. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2918–2924, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Building A Case-based Semantic English-Chinese Parallel Treebank (Shi et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1466.pdf
Data
FrameNet