The arText prototype: An automatic system for writing specialized texts

This article describes an automatic system for writing specialized texts in Spanish. The arText prototype is a free online text editor that includes different types of linguistic information. It is designed for a variety of end users and domains, including specialists and university students working in the fields of medicine and tourism, and laypersons writing to the public administration. ArText provides guidance on how to structure a text, prompts users to include all necessary contents in each section, and detects lexical and discourse problems in the text.


Introduction
In the field of Natural Language Processing (NLP), various types of linguistic information including phonological, morphological, lexical, syntactic, semantic and discourse-related features can be used to develop applications. To date, tools for writing texts have often been designed for general subject areas and included information on orthographic, grammatical and/or lexical aspects of the writing process. NLP researchers have tended not to study systems for structuring and writing specialized texts, although a few researchers have bucked this trend: Kinnunen et al. (2012) developed a system to identify and correct writing problems in English in several domains; Aluisio et al.
(2001)'s system helps non-native speakers write scientific publications in English; the Writing Pal (Dai et al., 2011) and Estilector 1 systems help improve academic writing in English and Spanish, respectively; and LanguageTool 2 is an open source proofreading program for non-specialized texts in 1 http://www.estilector.com/index.php. 2 https://www.languagetool.org/. several languages. To our knowledge, none of the systems that are currently available have considered the specific characteristics of textual genres in specialized domains, such as medicine, tourism and the public administration.
Writing specialized texts is more challenging than writing general texts (Cabré, 1999). Textual, lexical and discourse features are an essential component of textual genres, such as medical research papers, travel blog posts, or claims submitted to the public administration. Against this backdrop, this article aims to present a prototype for an automatic system that provides assistance in writing specialized texts in Spanish. The arText system includes textual, lexical and discourse-related information, and is useful for different end users. It provides guidance on how to structure a text, prompts users to include all necessary contents in each section, and detects lexical and discourse problems in the text.
Da Cunha et al. (in press) determined the most frequent textual genres that pose the greatest writing challenges for three groups: specialists and university students in medicine and tourism, and laypersons writing to the public administration. ArText was designed to help these users write the 15 textual genres included in Table 1.
Section 2 describes the characteristics of the system and its modules. Section 3 explains how the system was evaluated, while Section 4 presents conclusions and future lines of research.

Description of the System
ArText is a free online text editor that anyone can use, with no registration required. The system was developed in a LINUX environment using an Apache server and a MySQL database. A variety of resources were utilized in the back end (BASH, PERL, and PHP, with a Laravel Frame- work) and front end (HTML, CSS, JAVASCRIPT, with AJAX and JQUERY); Google Analytics is integrated into the site to measure traffic. Documents can be exported in four formats: PDF, TXT, HTML and ARTEXT. Previously saved documents can be uploaded using the AR-TEXT format, and the website includes a detailed user manual and a contact section for comments, questions and suggestions.
ArText can be accessed at http: //sistema-artext.com/, 3 and has been optimized for the Google Chrome browser. To use arText, click on "Start using arText" and pick one of the 15 textual genres mentioned above. This brings you to the text editor, where you can start writing using the text editor and the three modules integrated into arText: Structure, Contents and Phraseology; Format and Spellchecking; and Lexical and Discourse-based Recommendations.

Module 1. Structure, Contents and Phraseology
The left-hand column helps users structure and draft documents. Its interactive template includes typical sections, contents and phraseology for each textual genre. This information was extracted from da Cunha and Montané (2016), a corpusbased analysis following van Dijk (1989)'s textual approach. Specifically, users can insert: -Typical document sections -Typical contents found in each section -Phraseology related to each of these contents The text editor displays the sections which typically appear in a given textual genre. For example, the template for a "claim" to be submitted to the public administration includes the following sections: -Header -Addressee -Introductory clause -Supporting details -Request -Closing A drop-down menu in the left-hand column provides sample texts for each section, including section titles, where appropriate. For example, the "Supporting details" section includes two different contents: -Grounds for the claim -Attachments When users click on a specific content, arText displays a list of sample phrases that can be incorporated into the final text. For example, "Attachments" includes the following phrases: -Attached please find [document name]. 4 -The following supporting documents are attached: [list of documents]. Users can click on a stock phrase to include it in the text.

Module 2. Format and Spellchecking
The toolbar at the top of the screen includes an open source spellchecker (WebSpellChecker Ltd.) and various formatting options, e.g. to change font or font size; insert bullet points, images, tables and links; cut, copy and paste; print; and search. Since online storage is not provided, the user's manual includes instructions for uploading an image to Google Drive and inserting it into a document produced using arText.

Module 3. Lexical and Discourse-based Recommendations
By clicking on the review button in the righthand column, users can see a series of lexical and discourse-related recommendations for improving their texts. These recommendations are derived from da Cunha and Montané (2016) This module includes 11 main recommendations, all of which are displayed in the right-hand column, when appropriate. A subset of these recommendations is assigned to each textual genre, and all recommendations are adapted to the linguistic characteristics of each genre (da Cunha and Montané, 2016). Recommendations cover the following 11 topics: 1. Spelling out acronyms 2. Using acronyms systematically 3. Providing definitions 4. Using the passive voice 5. Using the 1st person systematically 6. Using subjectivity indicators 7. Repeating words 8. Using long sentences 9. Segmenting long sentences 10. Considering alternative discourse markers 11. Varying discourse markers By clicking on a given recommendation, users can see a more detailed explanation and suggestions. In some cases, arText also highlights phrases or content in the text editor. For example, one lexical recommendation, "Spelling out acronyms," highlights acronyms that are not spelled out when they first appear in the text (i.e. arText would highlight "COPD" if "chronic obstructive pulmonary disease" did not appear next to this acronym the first time the term was used). Some recommendations also actively engage users in the revision process. For example, the recommendation "Repeating words" shows a list of repeated words; when users click on a word in the right-hand column, all occurrences of this word in the text are highlighted. This recommendation is not displayed for highly specialized textual genres (e.g. research articles and abstracts), since lexical variation is usually avoided in these types of texts (Cabré, 1999). Some recommendations focus on the discourse level. For example, "Segmenting long sentences" highlights one or more long sentences; users can click to see suggestions for splitting them into shorter sentences. In this case, arText proposes these discourse segments in the right-hand column. The number of words used to determine long sentences differs for each textual genre, following da Cunha and Montané (2016).
Another discourse level recommendation refers to "Varying discourse markers." In this case, ar-Text displays a list of discourse markers repeated in the text. When users click on one of these markers, all of its occurrences are highlighted, and a list of alternative discourse markers used to express the same relationship (e.g. Cause, Restatement, Contrast and Condition, etc.) is displayed in the right-hand column. For instance, for the discourse marker "that is," used to express Restatement, ar-Text suggests the alternatives "in other words," "that is to say," "i.e." and "to put it another way."

Evaluation
Real and ad hoc texts were used to test arText's algorithms and linguistic rules and improve the system. Subsequently, the prototype was launched and data-driven and user-driven evaluations were conducted.
The data-driven evaluation was based on a test corpus with 24 texts corresponding to one textual genre from each domain; the corpus comprised eight medical abstracts, eight tourism-related informative articles and eight applications to the public administration. The linguistic characteristics of these texts were manually annotated, and the manual annotation and arText results were compared. Precision and recall were measured for a series of recommendations; the results are presented in Table 2  Recommendation 7 did not apply in the medical subcorpus; in the tourism and public administration subcorpora, 91.67% and 94.70% of detected words, respectively, were repeated in the text. For Recommendation 8, 100% of highlighted sentences in the medicine and tourism subcorpora were long sentences according to the thresholds for abstracts and informative articles; no long sentences appeared in the public administration subcorpus, so this recommendation could not be tested for this genre. No cases of Recommendation 11 were found in the medical and administration subcorpora; in the tourism corpus, 100% of detected discourse markers were repeated in the text, and adequate alternatives were proposed.
The user-driven evaluation aimed to determine how useful arText is. A survey designed using Google Forms focused on accessibility, the usefulness of the three modules and general issues. Three doctors, three tourism professionals, and 25 laypersons completed the survey; all laypersons were between 30-50 years old and had both higher education experience and internet skills. In general, respondents found arText to be user-friendly and useful; 100% of them would recommend the system to other people. Respondents found the section on structure to be the most useful module, while the approach to uploading images was considered the system's greatest weakness.

Conclusions and Future Work
This paper describes a prototype of an automatic system to assist users in writing specialized texts. The online arText editor helps users draft texts for 15 textual genres in three specialized domains: medicine, tourism and the public administration. It lays out the structure for each section of the document, suggests appropriate contents and stock phrases for each section, and detects typical linguistic errors. This innovative system is the first tool that considers lexical, textual and discourse features for specific textual genres. Moreover, the arText project is based on the idea that academic research can be shared with and used constructively by the general public.
In the future, the results of the data-driven evaluation will be utilized to improve arText's algorithms. A second user-driven evaluation will include a broader population (e.g. students). Finally, arText may be adapted to other textual genres, specialized domains and languages.