Generating Clinically Relevant Texts: A Case Study on Life-Changing Events

The need to protect privacy poses unique challenges to behavioral research. For instance, researchers often can not use examples drawn directly from such data to explain or illustrate key findings. In this research, we use data-driven models to synthesize realistic-looking data, focusing on discourse produced by social-media participants announcing life-changing events. We comparatively explore the performance of distinct techniques for generating synthetic linguistic data across different linguistic units and topics. Our approach offers utility not only for reporting on qualitative behavioral research on such data, where directly quoting a participant’s content can unintentionally reveal sensitive information about the participant, but also for clinical computational system developers, for whom access to realistic synthetic data may be sufficient for the software development process. Accordingly, the work also has implications for computational linguistics at large.


Introduction
Behavioral research using personal data, such as that from social media or clinical studies, must continually balance insights gained with respect for privacy. Ethical and legal demands also come into play. Deidentification involves removing information such as named entities, address-specific information and social security numbers. However, naive approaches are often prone to privacy attacks. Such de-identified data will often still contain information that, when combined with other data from different resources, can point to the individual who generated it. For example, if a de-identified dataset contains detailed demographic information, it could then be possible to extract a small list of people matching this information and to identify a specific person using other, publicly available data.
One approach that strikes a good balance is to synthesize realistic-looking data with the same statistical properties as actual data. Our contribution is to compare different techniques for synthesizing behavioral data. Specifically, we explore this problem in a case study with social media texts that involve social media participants making announcements about life-changing events, which are personal in nature and which also may affect, positively or negatively, a person's well-being.
Two immediate applications to clinical research that motivate this approach are: qualitative results reporting involving textual data and data access issues for software development purposes. Neither readers of scientific reports nor software developers need access to the original data as long as realistic looking synthetic data is available.

Related Work
In the clinical setting, data privacy is important. Anonymization aims to ensure that data is untraceable to an original user, whereas de-identification may allow the data to be traced back to a user with third-party information. Szarvas et al. (2007) developed a model for anonymizing personal health information (PHI) from discharge records. The model identifies PHIs Figure 1: Top level view of the proposed anonymization system. Data is fed to a model which here is a character-based Long-Short Term Memory (LSTM). The LSTM generates new tweets based on the input data.
in several steps and labels all entities which can be tagged from the text structure. It then queries for additional PHI phrases in the text with help from tagged PHI entities. Bayardo and Agrawal (2005) present improved kanonymity methods and provide efficient algorithms for data dimensionality reduction. However, even if information such as names of people or providers or quasi-identifiers (QIs) are removed, there are still ways to compare the de-identified data with other records having these identifiers.
In contrast to traditional anonymization and deidentification methods, generation of synthetic data can handle various aspects of hiding individuals, by aggregating and severing data from individual users, yet maintaining the statistical properties of the data used to train generation models. For this paper we explore several forms of data generation, using social media (Twitter) data about life-changing events as a case study. For example, Twitter data has been used for studying important life-changing events (De Choudhry et al., 2013;Li et al., 2014). Other studies present methods for anonymizing Twitter datasets. Terrovitis et al. (2008) model social media as an undirected, unlabeled graph which does retain privacy of social media users. Daubert et al. (2014) discuss the different methods for anonymization of Twitter data. However, there is a lack of work that addresses synthetic data creation using machine generation models.
This paper compares traditional statistical language models and Long Short Term Memory (LSTM) models to learn models from a training set of Twitter data to generate synthetic tweets. LSTMs are recurrent neural networks designed to learn both long and short term temporal sequences. These networks were introduced by Hochreiter and Schmidhuber (1997), with several improvements over the years, the most common of which include individual gating elements (Graves and Schmidhuber, 2005). LSTMs have been shown to perform at state-ofthe-art levels for many tasks, including handwriting recognition and generation, language modeling, and machine translation (Greff et al., 2015).

Data
Twitter is a microblogging platform used by people to post about their lives. If harnessed properly, tweets can be used for analysis and research of behavioral patterns as well as in studying health information.
We collected tweets using Twitter's streaming API along with customized query strings. These queries targeted the life-changing events of birth, death, marriage, and divorce. The tweet collection process suggested that users were more likely to share joyful news about marriage and birth, and birth of baby/brother/son/daughter/brother/sister parents of baby/son/daughter/boy/girl/angel arrival of baby/brother/son/daughter/sister/angel just gave birth to baby/son/daughter/boy/girl weigh/weighing #Number lbs/pounds its a boy/girl pregnant/c-section Table 2: Marriage patterns I'm/we are getting/sister/brother/mother married friend/uncle/aunt is getting married I/we/sister/brother/friend/uncle/aunt got married Table 3: Death patterns RIP mom/mama/dad/father/grandmother/brother/ RIP grandpa/grandfather/sister/friend he/mom/mama/dad/father passed away grandfather/grandpa/grandma passed away brother/sister/friend passed away less likely to share difficult news about death and divorce. Tweets on divorce were particularly scarce, so this event was ignored as the study continued.
The pool of tweets came from a collection of tweets from a mid-sized city in the US North East in 2013 as well as streaming tweets irrespective of location from early 2016. Roughly 18 million tweets were collected, including tweets for the three aforementioned categories of birth, death, and marriage. Only the text of the tweets was utilized for this study.
After inspecting the data, we formulated a set of lexical keywords, phrases and regular expressions to collect tweets by category. These reflected topical patterns, such as announcements of marriage or birth in the family, the weight of the newborn baby or whether it is a girl or a boy, or the passing of a friend or family member. Table 1 shows the patterns used to extract tweets about birth. Similarly, Table  2 shows the patterns for marriage, and Table 3 for death. We attempted to remove tweets about celebri-ties, TV shows, news stories, and jokes. After filtering, we selected and hand-annotated for each category a set of 2000 tweets. For comparison's sake we also chose randomly 2000 (unlabeled) tweets from the data, and call this the general category. Note that any tweet could be present in this category, including those from the first three categories.
We replaced Twitter usernames with the token @USER, while URL links, retweets, and emoticons were replaced with the keywords URL, RT, and EMOT, respectively. We removed the pound signs from hashtags to make it look more like general written language and to reduce the dictionary size of the word-based language models.
For the character-based models, we performed the following further steps. We separated each character in the input data by a space and replaced the usual space characters with <space>. We considered the tags introduced in the earlier pre-processing phase (e.g. -@USER) to be unique characters. On output, we replaced all space characters with the null string and replace the space tag <space> with the space character.
Tables 4 through 6 show samples of collected tweets.

Long-Short Term Memory
Recurrent neural networks (RNN) are popular models that have shown great potential in many natural language processing (NLP) tasks. LSTMs (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005) are a specific subset of RNNs that have been modified to be especially good at conditioning on both long and short term temporal sequences. LSTMs modify the standard design of neural networks in several ways: they eliminate the strict requirement that neurons only connect to other neurons in succeeding layers (adding recurrence), convert the standard neuron into a more complex memory cell, and add non-linear gating units which serve to govern the information flowing out of and recursively flowing back into the cell (Greff et al., 2015). The memory cell differentiates itself from a simple neuron by including the ability to remember its state over time; this coupled with gating units gives the LSTM the ability to recognize important long-term dependencies while simultaneously forgetting unimportant collocations. The LSTM we use here, as implemented by Karpathy (2015) modifies the original architecture by removing peephole connections. The intuitive understanding of the components in an LSTM memory block can be summarized as: 1. Input node: Also known as input modulation gate or new memory gate, takes the input and the past hidden state to summarize the new input in light of the past context from h t−1 .
2. Input gate: Also known as write gate, takes the input and the past hidden state to determine the importance of the current input as it effects the cell.
3. Forget gate: Also known as reset gate, takes the input and the past hidden state and gives the provision for the hidden layer to discard or forget the historical data.
4. Output gate: Takes the input and the past hidden state and determines what parts of the cell output c t need to be present in the new hidden state h t for the next timestep.
5. Memory cell: Takes advice from the forget gate and governed Input Node to determine the usefulness of the previous memory c t−1 to produce the new memory c t .
The functionality above describes only how a single LSTM memory block works, analogous to a single neuron in a regular neural network. To create an LSTM which learns, hundreds of these blocks are combined in a single layer (analogous to hundreds of nodes in a hidden layer), with the hidden output, h t ,c t of one block feeding into the input of another. Further complexity (and learning power) is added by including multiple layers of LSTM memory blocks. The final output of LSTM memory blocks (or inputs from one layer to the next) are provided by calculating y t = W y f (h t ), where W y is an output weight matrix to learn and f (·) is an activation function which can vary depending on use case.
The input, x t , to an LSTM memory block differs depending on implementation and use case. When using LSTMs for NLP, the input can be word or character-based. The LSTM used in this research (Karpathy, 2015), takes as input a vector representing an individual data item (character/word) and predicts the most probable data item given the current data item and the LSTM's previous states. Training, therefore, is done by taking an example sequence of data items, predicting the next data item using the current weights, calculating the difference between what was predicted and what should have been predicted, and back propagating this difference to up-date the weights. All LSTM models were trained for 500 epochs and sequence length of 50, where the sequence length is the length of time the LSTM cell is unrolled per iteration. Two LSTM layers were used to train the model on the input data. Each LSTM layer had 512 hidden nodes. Language generation can be performed after training, in which the LSTM is given either a starting sequence of data items (or it calculates the most probable sequence to start with), and then generates new data items based on its own predictions in previous time steps.

Standard N-gram Language Models
In order to demonstrate the particular utility of LSTMs for generating realistic tweets, the output of our character-and word-based LSTM methods was compared to that of standard n-gram backoff language models. Such models are widely used to model the probability of word sequences for many NLP applications, including machine translation, automatic speech recognition, and part-of-speech tagging. The SRI Language Modeling Toolkit (SRILM) was used to build 4-gram word-and character-based language models (Stolcke, 2002). Using these models, we then generate synthetic tweets using the OpenGRM Ngram library (Roark et al., 2012).

Experimental Design
For each event category, we divided the dataset of 2000 tweets into 1800 training and 200 testing instances. We used the machine translation quality metric BLEU (Papineni et al., 2002) to measure the similarity between machine generated tweets and the held out tests sets. For each model, we generated ten sets of 200 tweets. We calculated BLEU scores (without the brevity penalty) using the full 200-tweet test set as the reference for each candidate tweet and report the average of the BLEU scores of all ten sets of tweets generated by a given model.
To gain further insight into the effectiveness of the machine generated data, we asked human annotators to evaluate the generated tweets. We selected 800 tweets by randomly sampling: 400 human generated tweets (100 from each category), and 400 machine generated tweets. The 400 machine gener- ated tweets consisted of 25 tweets for each combination of model (LM-char, LM-word, LSTM-char, LSTM-word) and category (birth, marriage, death, general). For each tweet, the annotators indicated if they thought the tweet was generated by a human or machine, and they rated the quality of the tweet on the basis of syntax and semantics. Also, they indicated which topic category they thought the tweet belonged to.

Results
BLEU, a measure of n-gram precision widely used to evaluated machine translation output, was used to objectively evaluate the similarity between the human-generated tweets and the synthetic tweets produced by our models. Table 7 shows the BLEU scores for each combination of topic, model, and linguistic unit. The character-based LSTM models and the word-based LM models both perform very strongly, with each reporting the highest BLEU score in two of the four topics. We further note that the character-based LSTM always outperforms the word-based LSTM. Although it might be surprising that a character-based model would produce higher values for a word n-gram precision metric such as  BLEU, we suspect this is due to the fact that the large feature space of the word-based model in combination with the relatively small number of training tweets (roughly 1800) is not optimal for learning an LSTM model.

Human evaluation
A randomized set of 800 tweets, both real and synthetic, from all four topic categories was submitted to a panel of annotators (co-authors). Each annotator was asked to decide whether the tweet was real  (i.e., produced by a human) or synthetic (i.e., generated by one of the LSTM or n-gram language models). Each tweet was also rated in terms of its syntax and semantics on a five point Likert scale. In addition, the annotators were asked to select the intended topic category (birth, death, marriage, or general) of the tweet. Figure 5 shows the ability of human annotators to accurately identify a tweet's topic. In general, the annotators were able to identify the topic of the human tweets, with the weakest performance in the general category. Identifying the intended topic of the synthetic tweets was more challenging for the annotators, but accuracy was quite high in all topics other than general. We note that the general category was not filtered to remove tweets that could have belonged to the other topics, which could explain this discrepancy. Figures 3 and 4 show the distribution of each annotator's syntax and semantics scores for each model. These boxplots show that there was significant variance in the annotators' evaluation of the syntactic and semantic quality of the tweets. We note, however, that the models yielding the highest BLEU scores, char-LSTM and word-LM, tended to receive more favorable scores for syntactic and semantic quality. The character-based LM model, whose BLEU scores were significantly lower than other models, consistently received the most unsatisfactory evaluation of syntactic and semantic quality by all four annotators. It also seems that the LSTM models produce output that is more consistent in its semantic and syntactic quality, with smaller annotator to annotator variance than the LM models.
With regard to Figure 5, Annotators 1 and 2 rated 283 (selected randomly) tweets, while Annotators 3 and 4 rated all 800 tweets; and with regard to Figures 3, 4, and Table 8, all annotators rated 283 tweets. Annotators 1 and 2 have an academic background in linguistics, while the other two annotators do not have prior linguistic training, perhaps explaining why annotators 1 and 2 generally were better able to identify the topic category. Annotators 1 and 2 tended to have similar distributions of semantic and syntactic quality scores across models, which again is likely related to their previous training in linguistics and linguistic annotation. Annotator 4 may have been less forgiving about non-standard language use in the human-composed tweets, while annotator 3 was more tolerant of the syntax and semantics of machine-generated tweets. Congrats to @USER and her husband on the birth of their son Welcome to the Cyclone family, Eally Kinglan URL URL (Char LSTM Generated) @USER congratulations on birth of your son,20 days,ago,URL (Word LM Generated) @USER @USER @USER,looks like we're getting hitched in June URL (Word LM Generated) Im getting married in 17 days death (Char LSTM Generated) RT @USER rip grandma 2 8 16 (Word LM Generated) RT @USER The new part prigials give birth to bely son Junt and I'm delined a hape proud (Char LSTM Generated) I'm so sorry for your loss and world harry gotting to my funeral it was without URL (Word LM Generated) Table 8 shows the percent of instances a human annotator marked a synthetic tweet as human generated. Table 9 shows some of the tweets that were generated by language models but were identified by all four annotators as human generated. A few example tweets that were correctly identified by all four annotators as synthetic tweets are displayed in Table 10.

Conclusion
We have discussed generating synthetic data in the context of readers of scientific reports or software developers. In addition, one potential clinical application might be to apply this to patient transcripts so that they could be shown to other patients suffering from similar problems, e.g., for anonymized virtual group therapy. Such an approach might be especially useful in rural and developing regions, where clinical resources are sparse. Anonymization of data in research is often necessary to protect patient or user identity. This research explores data-driven models to generate realistic-looking discourse with the same statistical properties as a training corpus. Specifically, this research explores the synthetic generation of tweets, contrasting LM and LSTM models, character-based and word-based linguistic units, and the topic categories of birth, death, and marriage. Based on the results from objective BLEU scores and subjective human evaluation, the word-based LM and char-based LSTM models performed well, deceiving annotators 41 and 43 percent of the time on average into thinking a synthetic tweet was human generated. This research shows promising evidence that the synthetic generation of user data may be preferred to existing techniques of naive anonymization which can potentially lead to user identification through combination of demographic data mining and ancillary metadata.