Universal Dependencies, Release v2.4

Event Notification Type: 
Other
Abbreviated Title: 
Location: 
State: 
Country: 
Contact Email: 
City: 
Contact: 

Universal Dependencies, Release v2.4

We are very happy to announce the tenth release of annotated treebanks
in Universal Dependencies, v2.4, available at
http://universaldependencies.org/.

Universal Dependencies is a project that seeks to develop
cross-linguistically consistent treebank annotation for many languages
with the goal of facilitating multilingual parser development,
cross-lingual learning, and parsing research from a language typology
perspective (Nivre et al., 2016). The annotation scheme is based on
(universal) Stanford dependencies (de Marneffe et al., 2006, 2008,
2014), Google universal part-of-speech tags (Petrov et al., 2012), and
the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The
general philosophy is to provide a universal inventory of categories and
guidelines to facilitate consistent annotation of similar constructions
across languages, while allowing language-specific extensions when
necessary.

The 146 treebanks in v2.4 are annotated according to version 2 of the UD
guidelines and represent the following 83 languages: Afrikaans,
Akkadian, Amharic, Ancient Greek, Arabic, Armenian, Assyrian, Bambara,
Basque, Belarusian, Breton, Bulgarian, Buryat, Cantonese, Catalan,
Chinese, Classical Chinese, Coptic, Croatian, Czech, Danish, Dutch,
English, Erzya, Estonian, Faroese, Finnish, French, Galician, German,
Gothic, Greek, Hebrew, Hindi, Hindi English, Hungarian, Indonesian,
Irish, Italian, Japanese, Karelian, Kazakh, Komi Zyrian, Korean,
Kurmanji, Latin, Latvian, Lithuanian, Maltese, Marathi, Mbya Guarani,
Naija, North Sami, Norwegian, Old Church Slavonic, Old French, Old
Russian, Persian, Polish, Portuguese, Romanian, Russian, Sanskrit,
Serbian, Slovak, Slovenian, Spanish, Swedish, Swedish Sign Language,
Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Upper Sorbian, Urdu,
Uyghur, Vietnamese, Warlpiri, Welsh, Wolof and Yoruba. The 83 languages
belong to 20 families: Afro-Asiatic, Austro-Asiatic, Austronesian,
Basque, Code switching, Creole, Dravidian, Indo-European, Japanese,
Korean, Mande, Mongolic, Niger-Congo, Pama-Nyungan, Sign Language,
Sino-Tibetan, Tai-Kadai, Tupian, Turkic and Uralic. Depending on the
language, the treebanks range in size from less than 1,000 tokens to
almost 3 million tokens. We expect the next release to be available in
November 2019.

The size of the following 31 treebanks changed significantly since the
last release:
Armenian ArmTDP : 22788 → 36549
Assyrian AS : 0 → 453
Belarusian HSE : 8106 → 13325
Cantonese HK : 6264 → 13918
Chinese HK : 8701 → 9874
Classical Chinese Kyoto: 0 → 55026
Coptic Scriptorium : 22057 → 25756
English GUM : 80176 → 97697
Estonian EWT : 0 → 27286
French FQB : 0 → 24135
German HDT : 0 → 3055010
German LIT : 0 → 40456
Italian VIT : 0 → 279839
Karelian KKPP : 0 → 3094
Komi Zyrian IKDP : 1058 → 1287
Latvian LVTB : 152706 → 208965
Lithuanian ALKSNIS : 0 → 37396
Mbya Guarani Dooley : 0 → 11771
Mbya Guarani Thomas : 0 → 1318
Norwegian NynorskLIA : 13608 → 55410
Old Russian RNC : 0 → 14472
Old Russian TOROT : 0 → 149780
Polish PDB : 83571 → 351406 (name in previous
releases: Polish SZ)
Polish PUD : 0 → 18389
Romanian Nonstandard : 195055 → 241714
Russian Taiga : 20766 → 38555
Serbian SET : 86754 → 97673
Turkish GB : 0 → 16879
Welsh CCG : 0 → 10662
Wolof WTB : 0 → 44258

Joakim Nivre, Mitchell Abrams, Željko Agić, Lars Ahrenberg, Gabrielė
Aleksandravičiūtė, Lene Antonsen, Katya Aplonova, Maria Jesus Aranzabe,
Gashaw Arutie, Masayuki Asahara, Luma Ateyah, Mohammed Attia, Aitziber
Atutxa, Liesbeth Augustinus, Elena Badmaeva, Miguel Ballesteros, Esha
Banerjee, Sebastian Bank, Verginica Barbu Mititelu, Victoria Basmov,
John Bauer, Sandra Bellato, Kepa Bengoetxea, Yevgeni Berzak, Irshad
Ahmad Bhat, Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Agnė
Bielinskienė, Rogier Blokland, Victoria Bobicev, Loïc Boizou, Emanuel
Borges Völker, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman,
Adriane Boyd, Kristina Brokaitė, Aljoscha Burchardt, Marie Candito,
Bernard Caron, Gauthier Caron, Gülşen Cebiroğlu Eryiğit, Flavio
Massimiliano Cecchini, Giuseppe G. A. Celano, Slavomír Čéplö, Savas
Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho, Jayeol Chun, Silvie
Cinková, Aurélie Collomb, Çağrı Çöltekin, Miriam Connor, Marine Courtin,
Elizabeth Davidson, Marie-Catherine de Marneffe, Valeria de Paiva,
Arantza Diaz de Ilarraza, Carly Dickerson, Bamba Dione, Peter Dirix,
Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Hanne
Eckhoff, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Tomaž Erjavec, Aline
Etienne, Richárd Farkas, Hector Fernandez Alcalde, Jennifer Foster,
Cláudia Freitas, Kazunori Fujita, Katarína Gajdošová, Daniel Galbraith,
Marcos Garcia, Moa Gärdenfors, Sebastian Garza, Kim Gerdes, Filip
Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Gökırmak, Yoav Goldberg,
Xavier Gómez Guinovart, Berta González Saavedra, Matias Grioni, Normunds
Grūzītis, Bruno Guillaume, Céline Guillot-Barbance, Nizar Habash, Jan
Hajič, Jan Hajič jr., Linh Hà Mỹ, Na-Rae Han, Kim Harris, Dag Haug,
Johannes Heinecke, Felix Hennig, Barbora Hladká, Jaroslava Hlaváčová,
Florinel Hociung, Petter Hohle, Jena Hwang, Takumi Ikeda, Radu Ion,
Elena Irimia, Ọlájídé Ishola, Tomáš Jelínek, Anders Johannsen, Fredrik
Jørgensen, Hüner Kaşıkara, Andre Kaasen, Sylvain Kahane, Hiroshi
Kanayama, Jenna Kanerva, Boris Katz, Tolga Kayadelen, Jessica Kenney,
Václava Kettnerová, Jesse Kirchner, Arne Köhn, Kamil Kopacewicz, Natalia
Kotsyba, Jolanta Kovalevskaitė, Simon Krek, Sookyoung Kwak, Veronika
Laippala, Lorenzo Lambertino, Lucia Lam, Tatiana Lando, Septina Dian
Larasati, Alexei Lavrentiev, John Lee, Phương Lê Hồng, Alessandro Lenci,
Saran Lertpradit, Herman Leung, Cheuk Ying Li, Josie Li, Keying Li,
KyungTae Lim, Yuan Li, Nikola Ljubešić, Olga Loginova, Olga
Lyashevskaya, Teresa Lynn, Vivien Macketanz, Aibek Makazhanov, Michael
Mandl, Christopher Manning, Ruli Manurung, Cătălina Mărănduc, David
Mareček, Katrin Marheinecke, Héctor Martínez Alonso, André Martins, Jan
Mašek, Yuji Matsumoto, Ryan McDonald, Gustavo Mendonça, Niko Miekka,
Margarita Misirpashayeva, Anna Missilä, Cătălin Mititelu, Yusuke Miyao,
Simonetta Montemagni, Amir More, Laura Moreno Romero, Keiko Sophie Mori,
Tomohiko Morioka, Shinsuke Mori, Shigeki Moro, Bjartur Mortensen, Bohdan
Moskalevskyi, Kadri Muischnek, Yugo Murawaki, Kaili Müürisep, Pinkey
Nainwani, Juan Ignacio Navarro Horñiacek, Anna Nedoluzhko, Gunta
Nešpore-Bērzkalne, Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Yoshihiro
Nikaido, Vitaly Nikolaev, Rattima Nitisaroj, Hanna Nurmi, Stina Ojala,
Adédayọ̀ Olúòkun, Mai Omura, Petya Osenova, Robert Östling, Lilja
Øvrelid, Niko Partanen, Elena Pascual, Marco Passarotti, Agnieszka
Patejuk, Guilherme Paulino-Passos, Angelika Peljak-Łapińska, Siyao Peng,
Cenel-Augusto Perez, Guy Perrier, Daria Petrova, Slav Petrov, Jussi
Piitulainen, Tommi A Pirinen, Emily Pitler, Barbara Plank, Thierry
Poibeau, Martin Popel, Lauma Pretkalniņa, Sophie Prévost, Prokopis
Prokopidis, Adam Przepiórkowski, Tiina Puolakainen, Sampo Pyysalo,
Andriela Rääbis, Alexandre Rademaker, Loganathan Ramasamy, Taraka Rama,
Carlos Ramisch, Vinit Ravishankar, Livy Real, Siva Reddy, Georg Rehm,
Michael Rießler, Erika Rimkutė, Larissa Rinaldi, Laura Rituma, Luisa
Rocha, Mykhailo Romanenko, Rudolf Rosa, Davide Rovati, Valentin Roșca,
Olga Rudina, Jack Rueter, Shoval Sadde, Benoît Sagot, Shadi Saleh,
Alessio Salomoni, Tanja Samardžić, Stephanie Samson, Manuela
Sanguinetti, Abigail Walsh Sarah McGuinness, Dage Särg, Baiba Saulīte,
Yanin Sawanakunanon, Nathan Schneider, Sebastian Schuster, Djamé Seddah,
Wolfgang Seeker, Mojgan Seraji, Mo Shen, Atsuko Shimada, Hiroyuki
Shirasu, Muh Shohibussirri, Dmitry Sichinava, Natalia Silveira, Maria
Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron
Smith, Isabela Soares-Bastos, Carolyn Spadine, Antonio Stella, Milan
Straka, Jana Strnadová, Alane Suhr, Umut Sulubacak, Shingo Suzuki, Zsolt
Szántó, Dima Taji, Yuta Takahashi, Fabio Tamburini, Takaaki Tanaka,
Isabelle Tellier, Guillaume Thomas, Liisi Torga, Trond Trosterud, Anna
Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka Urešová,
Larraitz Uria, Hans Uszkoreit, Sowmya Vajjala, Daniel van Niekerk,
Gertjan van Noord, Viktor Varga, Eric Villemonte de la Clergerie,
Veronika Vincze, Lars Wallin, Jing Xian Wang, Jonathan North Washington,
Maximilan Wendt, Seyi Williams, Mats Wirén, Christian Wittern, Tsegay
Woldemariam, Tak-sum Wong, Alina Wróblewska, Mary Yako, Naoki Yamazaki,
Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Zhuoran Yu, Zdeněk
Žabokrtský, Amir Zeldes, Daniel Zeman, Manying Zhang, Hanzhi Zhu

References

Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D.
Manning. 2006. Generating typed dependency parses from phrase structure
parses. In Proceedings of LREC.

Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The
Stanford typed dependencies representation. In COLING Workshop on
Cross-framework and Cross-domain Parser Evaluation.

Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri
Haverinen, Filip Ginter, Joakim Nivre, and Christopher Manning. 2014.
Universal Stanford Dependencies: A cross-linguistic typology. In
Proceedings of LREC.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg,
Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo
Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman. 2016. Universal
Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal
part-of-speech tagset. In Proceedings of LREC.

Daniel Zeman. 2008. Reusable Tagset Conversion Using Tagset Drivers. In
Proceedings of LREC.