Unsupervised Learning of Style-sensitive Word Vectors

This paper presents the first study aimed at capturing stylistic similarity between words in an unsupervised manner. We propose extending the continuous bag of words (CBOW) embedding model (Mikolov et al., 2013b) to learn style-sensitive word vectors using a wider context window under the assumption that the style of all the words in an utterance is consistent. In addition, we introduce a novel task to predict lexical stylistic similarity and to create a benchmark dataset for this task. Our experiment with this dataset supports our assumption and demonstrates that the proposed extensions contribute to the acquisition of style-sensitive word embeddings.


Introduction
Analyzing and generating natural language texts requires the capturing of two important aspects of language: what is said and how it is said. In the literature, much more attention has been paid to studies on what is said. However, recently, capturing how it is said, such as stylistic variations, has also proven to be useful for natural language processing tasks such as classification, analysis, and generation (Pavlick and Tetreault, 2016;Wang et al., 2017). This paper studies the stylistic variations of words in the context of the representation learning of words. The lack of subjective or objective definitions is a major difficulty in studying style (Xu, 2017). Previous attempts have been made to define a selected aspect of the notion of style (e.g., politeness) (Mairesse and Walker, 2007;Pavlick and Nenkova, 2015;Flekova et al., 2016;Preotiuc-Pietro et al., 2016;Sennrich et al., 2016;; however, it is not straightforward to create strict guidelines for identifying the stylistic profile of a given text. The systematic evaluations of stylesensitive word representations and the learning of style-sensitive word representations in a supervised manner are hampered by this. In addition, there is another trend of research forward controlling style-sensitive utterance generation without defining the style dimensions (Li et al., 2016;Akama et al., 2017); however, this line of research considers style to be something associated with a given specific character, i.e., a persona, and does not aim to capture the stylistic variation space.
The contributions of this paper are three-fold. (1) We propose a novel architecture that acquires style-sensitive word vectors ( Figure 1) in an unsupervised manner. (2) We construct a novel dataset for style, which consists of pairs of stylesensitive words with each pair scored according to its stylistic similarity. (3) We demonstrate that our word vectors capture the stylistic similarity between two words successfully. In addition, our training script and dataset are available on https://jqk09a.github.io/ style-sensitive-word-vectors/.

Style-sensitive Word Vector
The key idea is to extend the continuous bag of words (CBOW) (Mikolov et al., 2013a) by distin-guishing nearby contexts and wider contexts under the assumption that a style persists throughout every single utterance in a dialog. We elaborate on it in this section.

Notation
Let w t denote the target word (token) in the corpora and U t = {w 1 , . . . , w t−1 , w t , w t+1 , . . . , w |Ut| } denote the utterance (word sequence) including w t . Here, w t+d or w t−d ∈ U t is a context word of w t (e.g., w t+1 is the context word next to w t ), where d ∈ N >0 is the distance between the context words and the target word w t .
For each word (token) w, bold face v w andṽ w denote the vector of w and the vector predicting the word w. Let V denote the vocabulary.

Baseline Model (CBOW-NEAR-CTX)
First, we give an overview of CBOW, which is our baseline model. CBOW predicts the target word w t given nearby context words in a window with width δ: The set C near wt contains in total at most 2δ words, including δ words to the left and δ words to the right of a target word. Specifically, we train the word vectorsṽ wt and v c (c ∈ C near wt ) by maximizing the following prediction probability: The CBOW captures both semantic and syntactic word similarity through the training using nearby context words. We refer to this form of CBOW as CBOW-NEAR-CTX. Note that, in the implementation of Mikolov et al. (2013b), the window width δ is sampled from a uniform distribution; however, in this work, we fixed δ for simplicity. Hereafter, throughout our experiments, we turn off the random resizing of δ.

Learning Style with Utterance-size Context Window (CBOW-ALL-CTX)
CBOW is designed to learn the semantic and syntactic aspects of words from their nearby context (Mikolov et al., 2013b). However, an interesting problem is determining the location where the stylistic aspects of words can be captured. To address this problem, we start with the assumption that a style persists throughout each single utter-ance in a dialog, that is, the stylistic profile of a word in an utterance must be consistent with other words in the same utterance. Based on this assumption, we propose extending CBOW to use all the words in an utterance as context, instead of only the nearby words. Namely, we expand the context window from a fixed width to the entire utterance. This training strategy is expected to lead to learned word vectors that are more sensitive to style rather than to other aspects. We refer to this version as CBOW-ALL-CTX.

Learning the Style and Syntactic/Semantic Separately
To learn the stylistic aspect more exclusively, we further extended the learning strategy.

Distant-context Model (CBOW-DIST-CTX)
First, remember that using nearby context is effective for learning word vectors that capture semantic and syntactic similarities. However, this means that using the nearby context can lead the word vectors to capture some aspects other than style. Therefore, as the first extension, we propose excluding the nearby context C near wt from all the context C all wt . In other words, we use the distant context words only: (4) We expect that training with this type of context will lead to word vectors containing the stylesensitive information only. We refer to this method as CBOW-DIST-CTX.

Separate Subspace Model (CBOW-SEP-CTX)
As the second extension to distill off aspects other than style, we use both nearby and all contexts (C near wt and C all wt ). As Figure 2 shows, both the vector v w andṽ w of each word w ∈ V are divided into two vectors: where ⊕ denotes vector concatenation. Vectors x w andx w indicate the style-sensitive part of v w andṽ w respectively. Vectors y w andỹ w indicate the syntactic/semantic-sensitive part of v w and v w respectively. For training, when the context words are near the target word (C near wt ), we update both the style-sensitive vectors (x wt , x c ) and the syntactic/semantic-sensitive vectors (ỹ wt , y c ), i.e., v wt , v c . Conversely, when the context words are far from the target word (C dist wt ), we only update the style-sensitive vectors (x wt , x c ). Formally, the prediction probability is calculated as follows: (7) At the time of learning, two prediction probabilities (loss functions) are alternately computed, and the word vectors are updated. We refer to this method using the two-fold contexts separately as the CBOW-SEP-CTX.

Experiments
We investigated which word vectors capture the stylistic, syntactic, and semantic similarities.

Settings
Training and Test Corpus We collected Japanese fictional stories from the Web to construct the dataset. The dataset contains approximately 30M utterances of fictional characters. We separated the data into a 99%-1% split for training and testing. In Japanese, the function words at the end of the sentence often exhibit style (e.g., desu+wa, desu+ze 1 ;) therefore, we used an existing lexicon of multi-word functional expressions (Miyazaki et al., 2015). Overall, the vocabulary size |V| was 100K.
Hyperparameters We chose the dimensions of both the style-sensitive and the syntactic/semanticsensitive vectors to be 300, and the dimensions of the baseline CBOWs were 300. The learning rate was adjusted individually for each part in {x w , y w ,x w ,ỹ w } such that "the product of the 1 These words mean the verb be in English.
learning rate and the expectation of the number of updates" was a fixed constant. We ran the optimizer with its default settings from the implementation of Mikolov et al. (2013a). The training stopped after 10 epochs. We fixed the nearby window width to δ = 5.

Data Construction
To verify that our models capture the stylistic similarity, we evaluated our style-sensitive vector x wt by comparing to other word vectors on a novel artificial task matching human stylistic similarity judgments. For this evaluation, we constructed a novel dataset with human judgments on the stylistic similarity between word pairs by performing the following two steps. First, we collected only style-sensitive words from the test corpus because some words are strongly associated with stylistic aspects (Kinsui, 2003;Teshigawara and Kinsui, 2011) and, therefore, annotating random words for stylistic similarity is inefficient. We asked crowdsourced workers to select style-sensitive words in utterances. Specifically, for the crowdsourced task of picking "style-sensitive" words, we provided workers with a word-segmented utterance and asked them to pick words that they expected to be altered within different situational contexts (e.g., characters, moods, purposes, and the background cultures of the speaker and listener.). Then, we randomly sampled 1, 000 word pairs from the selected words and asked 15 workers to rate each of the pairs on five scales (from −2: "The style of the pair is different" to +2: "The style of the pair is similar"), inspired by the syntactic/semantic similarity dataset (Finkelstein et al., 2002;Gerz et al., 2016). Finally, we picked only word pairs featuring clear worker agreement in which more than 10 annotators rated the pair with the same sign, which consisted of random pairs of highly agreeing style-sensitive words. Consequently, we obtained 399 word pairs with similarity scores. To our knowledge, this is the first study that created an evaluation dataset to measure the lexical stylistic similarity.
In the task of selecting style-sensitive words, the pairwise inter-annotator agreement was moderate (Cohen's kappa κ is 0.51). In the rating task, the pairwise inter-annotator agreement for two classes ({−2, −1} or {+1, +2}) was fair (Cohen's kappa κ is 0.23). These statistics suggest that, at least in Japanese, native speakers share a sense of stylesensitivity of words and stylistic similarity between style-sensitive words.

Stylistic Sensitivity
We used this evaluation dataset to compute the Spearman rank correlation (ρ style ) between the cosine similarity scores between the learned word vectors cos(v w , v w ) and the human judgements. Table 1 shows the results on its left side. First, our proposed model, CBOW-ALL-CTX outperformed the baseline CBOW-NEAR-CTX. Furthermore, the x of CBOW-DIST-CTX and CBOW-SEP-CTX demonstrated better correlations for stylistic similarity judgments (ρ style = 56.1 and 51.3, respectively). Even though the x of CBOW-SEP-CTX was trained with the same context window as CBOW-ALL-CTX, the style-sensitivity was boosted by introducing joint training with the near context. CBOW-DIST-CTX, which uses only the distant context, slightly outperforms CBOW-SEP-CTX. These results indicate the effectiveness of training using a wider context window.

Syntactic and Semantic Evaluation
We further investigated the properties of each model using the following criterion: (1) the model's ability to capture the syntactic aspect was assessed through a task predicting part of speech (POS) and (2) the model's ability to capture the semantic aspect was assessed through a task calculating the correlation with human judgments for semantic similarity.

Syntactic Sensitivity
First, we tested the ability to capture syntactic similarity of each model by checking whether the POS of each word was the same as the POS of a neighboring word in the vector space. Specifically, we calculated SYNTAXACC@N defined as follows: where I[condition] = 1 if the condition is true and I[conditon] = 0 otherwise, the function POS(w) returns the actual POS tag of the word w, and N (w) denotes the set of the N top similar words {w } to w w.r.t. cos(v w , v w ) in each vector space. Table 1 shows SYNTAXACC@N with N = 5 and 10. For both N , the y (the syntactic/semantic part) of CBOW-NEAR-CTX, CBOW-ALL-CTX and CBOW-SEP-CTX achieved similarly good. Interestingly, even though the x of CBOW-SEP-CTX used the same context as that of CBOW-ALL-CTX, the syntactic sensitivity of x was suppressed. We speculate that the syntactic sensitivity was distilled off by the other part of the CBOW-SEP-CTX vector, i.e., y learned using only the near context, which captured more syntactic information. In the next section, we analyze CBOW-SEP-CTX for the different characteristics of x and y.

Semantic and Topical Sensitivities
To test the model's ability to capture the semantic similarity, we also measured correlations with the Japanese Word Similarity Dataset (JWSD) (Sakaizawa and Komachi, 2018), which consists of 4,000 Japanese word pairs annotated with semantic similarity scores by human workers. For each model, we calculate and show the Spearman rank correlation score (ρ sem ) between the cosine similarity score cos(v w , v w ) and the human judgements on JWSD in Table 1 2 . CBOW-DIST-CTX has the lowest score (ρ sem = 15.9); however, surprisingly, the stylistic vector x wt has the highest score (ρ sem = 28.9), while both vectors have a high ρ style . This result indicates that the proposed stylistic vector x wt captures not only the stylistic similarity but also the captures semantic similarity, contrary to our expectations (ideally, we want the stylistic vector to capture only the stylistic similarity). We speculate that this is because not only the style but also the topic is often consistent in single utterances. For example, "サンタ (Santa Clause)" and "トナカイ (reindeer)" are topically relevant words and these words tend to appear in a single utterance. Therefore, stylistic vectors {x w } using all the context words in an utterance also capture the topic relatedness. In addition, JWSD contains topic-related word pairs and synonym pairs; therefore the word vectors that capture the topic similarity have higher ρ sem . We will discuss this point in

Analysis of Trained Word Vectors
Finally, to further understand what types of features our CBOW-SEP-CTX model acquired, we show some words 3 with the four most similar words in Table 2. Here, for English readers, we also report a result for English 4 . The English result also shows an example of the performance of our model on another language. The left side of Table 2 (for stylistic vector x) shows the results. We found that the Japanese word "拙者 (I; classical)" is similar to "ござる (be; classical)" or words containing it (the second row of Table 2). The result looks reasonable, because words such as "拙者 (I; classical)" and "ござる (be; classical)" are typically used by Japanese Samurai or Ninja. We can see that the vectors captured the similarity of these words, which are stylistically consistent across syntactic and semantic varieties. Conversely, the right side of the table (for the syntactic/semantic vector y) shows that the word "拙者 (I; classical)" is similar to the personal pronoun (e.g., "僕 (I; male, childish)"). We further confirmed that 15 the top similar words are also personal pronouns (even though they are not shown due to space limitations). These results indicate that the proposed CBOW-SEP-CTX model jointly learns two different types of lexical similar-ities, i.e., the stylistic and syntactic/semantic similarities in the different parts of the vectors. However, our stylistic vector also captured the topic similarity, such as "サンタ (Santa Clause)" and "トナカイ (reindeer)" (the fourth row of Table 2). Therefore, there is still room for improvement in capturing the stylistic similarity.

Conclusions and Future Work
This paper presented the unsupervised learning of style-sensitive word vectors, which extends CBOW by distinguishing nearby contexts and wider contexts. We created a novel dataset for style, where the stylistic similarity between word pairs was scored by human. Our experiment demonstrated that our method leads word vectors to distinguish the stylistic aspect and other semantic or syntactic aspects. In addition, we also found that our training cannot help confusing some styles and topics. A future direction will be to addressing the issue by further introducing another context such as a document or dialog-level context windows, where the topics are often consistent but the styles are not.