Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students

This work presents the Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students, a corpus of raw data for general use. Its purpose is to offer data for the study of of language and interactions via Instant Messaging (IM) among bachelors. Our paper consists of an overview of both the corpus’s content and demographic metadata. Furthermore, it presents the current research being conducted with it —namely parenthetical expressions, orality traits, and code-switching. This work also includes a brief outline of similar corpora and recent studies in the field of IM.


Introduction
As digital communication technologies grow and spread, computer mediated communication (CMC) (Baron, 1984) -which includes (IM)changes and becomes a very distinct sort of interaction. According toÁlvarez (2011), a new discourse level emerges through such interaction -one that makes the distinction between writing and speaking less and less clear. This discourse style has been previously called both spoken writing (Blanco Rodríguez, 2002) and oralized text (Yus Ramos, 2010).
In order to study such a particular register, it is necessary to gather a robust corpus. The Sociolinguistic Corpus of WhatsApp Chats in Spanish for College Speech Analysis intends to be a resource that allows researchers to explore and characterize conversations held by college students and their peers, or other kind of participants, via the IM application known as WhatsApp (hereafter WA). This corpus is limited to bachelors studying at Ciudad Universitaria (commonly known as C.U.), the main campus of the National Autonomous University of Mexico (UNAM). The reason for choosing bachelors is because, in Mexico, 94.1% of the population with an undergraduate degree or a higher educational level uses the Internet for communication purposes, this mainly via IM, and generally they access the net on a smartphone. Furthermore, most of IM users are 12 to 34 years old, which is the age group the majority of college students belong to (INEGI, 2016).

Similar Corpora
Prior to the collection of WA corpora, other databases were created to allow the study of CMC. Examples of said data are the NPS Internet Chatroom Conversations Corpus (Forsyth et al., 2010), an annotated corpus of interactions in English in diverse chatrooms, and the Dortmunder Chat Corpus (Beisswenger, 2013), a robust, annotated corpus in German divided in 4 subcorpora, based on the topic of the chats (free time, learning contexts, cosultations, and media). In addition to these corpora, it is worth mentioning the NUS SMS Corpus (Chen and Kan, 2013) which comprises 71,000 messages, both in English and Chinese. Even though the SMS is not an internet-mediated mean of communication, it can be compared to interactions via WA.
Although the study of WA chats is a relatively novel research field, there are several corpora specialized mostly on them. One of the most impor-tant projects is the one conducted by researchers of the Universities of Zurich, Bern, Neuchâtel and Leipzig. The What's up, Switzerland? corpus (Stark et al., 2014-) has as main aim the characterization of WA chats and the comparison of these to SMS. It has 617 chats written by 1,538 participants. Since just 945 of them consented to have their chats used, the total number of messages available for linguistic research is 763,650 comprising 5,543,692 tokens. Only 426 participants shared further demographic information (Überwasser and Stark, 2017). Given the fact that Switzerland is a multilingual country, 46% of the corpus is in German, 34% in French, 14% in Italian, 3% in Romansh and 3% in English. The sociodemographic information saved as metadata comprises age, gender, educational level, and place of residence divided in 9 regions. So far, the publications derived from this project focus not only on the different levels of language, but also the role of complementary items in conversation, such as images, acronyms, emojis, emoticons, and combination or modification of characters. Verheijen and Stoop (2016) compiled a corpus which is a part of the SoNaR project (STEVIN Nederlandstalig Referentiecorpus) of posts and WA chats in Dutch. The corpus has 332,657 words in 15 chats donated by 34 informants. Their metadata encompasses informants' name, birth place and date, age, gende, educational level, and place in which the chats were sent. This corpus was used as one of the bases for a research where WA and other written forms were compared (Verheijen, 2017). Hilte et al. (2017) compiled a corpus of chats between Flemish teenagers aged 13-20 taken from Facebook Messenger, WA, and iMessage. This, with the purpose of identifying the impact of social variables -namely age, gender and education-in teenagers' non-standard use of language in CMC.
In addition to these, an ongoing project is that of MoCoDa2 conducted by Beisswenger et al. (2017), which is a continuation of the preceeding corpus MoCoDa, and has put together 2,198 interactions with 19,161 user posts.
Nevertheless, all of these authors did not define what they conceive as a chat. In order to avoid any misconception, in the making of this corpus we consider a chat an exchange between two users regardless of length or date. Meaning that it does not matter when the conversation started, but rather the wholeness of the txt file.
Although there are, indeed, corpora of WA chats in Spanish, they are not for general use, but project-related. Besides, they are not as robust as the aforementioned. Said corpora are presented in the following section.

Research on WhatsApp Chats
Because of their peculiarities, virtual interactions through diverse platforms like WA, WeChat, Facebook Messenger, and so forth have drawn the attention of linguists. Some of the previous studies that have been conducted using similar corpora are varied in the topics they approach. Some of the aspects of language that can be studied with sociolinguistic corpora like ours are discourse units and phenomena such as turns and turntaking, speech acts, and interactions ( Another phenomenon that has proved itself to be interesting is code-switching in IM (Nurhamidah, 2017;Zaehres, 2016;Zagoricnik, 2014). As Al-Emran and Al-Qaysi (2013) have stated "WhatsApp is found to be the most social networking App used for code-switching by both students and educators"; which is why authors like Elsayed (2014) have focused on such population.

Sociolinguistic Variables in the Corpus
Considering this is a sociolinguistic corpus, several sociodemographic variables were defined as metadata and divided into two groups: (a) Balance axes, which are the two variables that help to keep the balance and representativeness of the corpus: • Sex: male or female 1 1 We chose sex over gender because it is the sociodemo- Our goal was to collect at least 1% of the campus's population maintaining the same proportion of men and women as in each faculty.
(b) Post-stratification criteria, whose relevance will depend on the type of study conducted with this corpus as In overall, our corpus has 12 sociolinguistic variables that contribute to a large degree to the characterization and study of language in IM among youngsters. Furthermore, this allows our data to become a subcorpus of a much larger one in the future.

Data Collection
After establishing the sociodemographic metadata to be collected along with WA chats, the team proceeded to gather the data. In order to ease the data processing we collected chats with two participants only. All chats were donated as text files sent directly from the donors' devices, while metadata was collected manually. At the initial stage, the chats were collected using the directed sampling method. A team approached random students on campus explaining the project to them and inviting them to collaborate donating one or more WA chats. Those who consented to share graphic variable used by UNAM in its statistics. their chats -the donors-sent them via email to an institutional address, then were asked to answer a survey so the team could gather their and their interlocutor's sociodemographic information. After that, the information provided was entered into a spreadsheet along with a code that made it possible to link it to the corresponding text file. It is worth mentioning that the same metadata was collected with both methods.

Data Processing
The processing of data was done in two different stages. First, by means of a Python script, the collected data was saved into a spreadsheet. In the same stage, it was organized in JSON format and sent to the database as a document file. Second, a program allowed the users anonymity by changing their names in every chat to USER1 and USER2, and by deleting sensitive information -such as names, addresses, emails, phone numbers, bank accounts, and so forth.
Currently, queries can be done with both with Python scripts and MongoDB. Said tools permit the filtering of results depending on the metadata, allowing also the possibility of selecting relevant sociological variables and determining their ranges. In the future, we will develop an interface that makes the access and consultations to the database possible.

The Corpus
Although the corpus is still being processed, it has reached a mature stage which allows us to offer a general panorama of its content and demographics. The following figures represent the corpus state by March 2018. Should some changes be made, the final figures will be presented in future publications.

Content
Nowadays, we have 835 chats with 1,325 informants. After deleting dates, user names and all messages generated automatically by the app, we got 66,465 messages, 756,066 tokens and 45,497 types available for linguistic research.
Despite the fact that the vast majority of our informants are Mexican native Spanish speakers, texts in some other languages were found as well. Most of the messages in a language other than Spanish were written in English, however there are also texts in French, Japanese, Italian, German, Korean, Greek and Chinese.
Other than that, we were also able to pinpoint which are the most frequently used lexical words among the informants. Students seem to be keen on using the ones displayed in Table 1.
As it was previously mentioned, communication via IM shares several features with oral communication. However, since it lacks physical copresence, it is necessary to develop some compensation strategies. Which is why emojis and emoticons are so widespread. The most frequent of these icons found in the corpus are shown in Ta

Demographics
As stated above, our corpus was built with the collaboration of 1,325 informants (51% women and 49% men), between ages 14 and 60, born in 23 of the 32 states in Mexico. Such a wide range of informants' age is due to the fact that some donors shared chats, held not with peers, but with people in their families, coworkers, or friends. Of all informants, 84.9% are undergraduates studying at C.U. Out of these students, 51.2% are women and 48.8% are men. Henceforth, all figures refer only to bachelor informants. 80.7% of bachelors in the corpus were born in Mexico City, while 11.7% were born in Estado de México, the biggest state surrounding the capital. The rest were born in 20 other states -particularly Hidalgo, Guerrero and Michoacán.
Our corpus have also informants born in Chile (2), Colombia (1), The United States (1), and 3 that did not report their birthplace. 77.4% of our informants live in the city, while 19.4% live in Estado de México. The remaining 3.2% did not state their post code.
As to sexual orientation, 88.9% of students in the corpus declared themselves as heterosexual, 5.5 % bisexual and 5.4 % homosexual. Just .2% chose not to share such information.
Although the purpose of this corpus is to collect data from Mexican native Spanish speakers, some informants donated chats with people from other countries: Chile, Colombia, Costa Rica, Italy, Lebanon, and the United States, to name a few. All of these conversations were conducted mostly in Spanish. As second language, informants claimed to speak Arabian, Bulgarian, Chinese, English, French, German, modern Greek, Italian, Japanese, Korean, Nahuatl, Portuguese, Russian, or Swedish.
The students who donated their chats and their interlocutors belong to different faculties. The following table presents both the faculty roster at C.U. and the number of informants by sex.

Faculty
Male

Current Research
At the time of the writing, there are three lines of research in the study of our corpus. One of them is parenthetical expressions that can work as repairs, instructions for interpretation, onomatopoetic expressions, surrogate prosodic cues to indicate how an utterance should be read, or surrogate proxemic cues such as emotes -sentences that indicate imaginary actions taking place at the moment of texting (Christopherson, 2010 There is also the quantitative approach to codeswitching from a sociolinguistic perspective, followed by a qualitative study of the forms and functions of it (Elsayed, 2014).

Conclusion and Future Work
We presented a corpus that will make the study of language usage by college students via an Instant Messaging application possible. Its metadata will allow research, not only on mere linguistic phenomena, but also the stablishment of correlation between these and sociodemographic variables. Some of the phenomena that can be studied in interactions, such as the ones via IM, are phonic traits, parenthetical expressions, code-switching, turn-taking, speech acts, linguistic variation, and usage of emojis and emoticons.
Since the processing of data is still a work in progress. As next step, we plan to perform an evaluation of the anonymization process.
The objective of this corpus is to be used by both scholars and students in our group for the research of the aforementioned phenomena and others, and it is our intention to make it available upon request for others, with academic purposes only.