VICTOR: a Dataset for Brazilian Legal Documents Classification

Pedro Henrique Luz de Araujo, Teófilo Emídio de Campos, Fabricio Ataides Braz, Nilton Correia da Silva


Abstract
This paper describes VICTOR, a novel dataset built from Brazil’s Supreme Court digitalized legal documents, composed of more than 45 thousand appeals, which includes roughly 692 thousand documents—about 4.6 million pages. The dataset contains labeled text data and supports two types of tasks: document type classification; and theme assignment, a multilabel problem. We present baseline results using bag-of-words models, convolutional neural networks, recurrent neural networks and boosting algorithms. We also experiment using linear-chain Conditional Random Fields to leverage the sequential nature of the lawsuits, which we find to lead to improvements on document type classification. Finally we compare a theme classification approach where we use domain knowledge to filter out the less informative document pages to the default one where we use all pages. Contrary to the Court experts’ expectations, we find that using all available data is the better method. We make the dataset available in three versions of different sizes and contents to encourage explorations of better models and techniques.
Anthology ID:
2020.lrec-1.181
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1449–1458
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.181
DOI:
Bibkey:
Cite (ACL):
Pedro Henrique Luz de Araujo, Teófilo Emídio de Campos, Fabricio Ataides Braz, and Nilton Correia da Silva. 2020. VICTOR: a Dataset for Brazilian Legal Documents Classification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1449–1458, Marseille, France. European Language Resources Association.
Cite (Informal):
VICTOR: a Dataset for Brazilian Legal Documents Classification (Luz de Araujo et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.181.pdf