The Users Who Say ‘Ni’: Audience Identification in Chinese-language Restaurant Reviews

We give an algorithm for disambiguating generic versus referential uses of second-person pronouns in restaurant reviews in Chinese. Reviews in this domain use the ‘you’ pronoun 你 either generically or to refer to shopkeepers, readers, or for self-reference in reported conversation. We ﬁrst show that linguistic features of the lo-cal context (drawn from prior literature) help in disambigation. We then show that document-level features (n-grams and document-level embeddings)— not previously used in the referentiality literature— actually give the largest gain in performance, and suggest this is because pro-nouns in this domain exhibit ‘one-sense-per-discourse’. Our work highlights an important case of discourse effects on pro-noun use, and may suggest practical implications for audience extraction and other sentiment tasks in online reviews.


Introduction and Task Description
Detecting whether a given entity is referential is an important question in computational discourse processing. Linguistic features in the local context of a given mention have been successfully used for determining whether a second-person pronoun (you) in dialogue is referential (Gupta et al., 2007b;Purver et al., 2009). The related task of anaphoricity detection is an important subtask of coreference resolution (Ng and Cardie, 2002;Ng, 2004;Luo, 2007;Zhou and Kong, 2009;Recasens et al., 2013).
In this paper we consider the task of audience identification in review texts, using restaurant reviews written in Chinese. Our task is to disambiguate a mention of the Chinese second-person pronoun 你 (ni, "you") into the following four labels that we found to occur commonly in reviews:

Generic
饮品只有雪碧和可乐，而且要点才拿给你 For drinks they only have Sprite and Coke, and you have to order before they'll give them to you.

Referential -Shop 这么好的服务下次还来你家哦
With such good service, I'll definitely come back to your shop next time!

Referential -Reader 不信你们去试你们会终身遗憾！
Go and try it if you don't believe me -your whole body will feel regret! Referential -Writer / Self 店员说"你们就只要钵钵鸡？" The shop employee said, "You only want the stone-bowl chicken?" We aim to gain insight into the linguistics of narrative by distinguishing the types of discourse contexts in which different referential senses are found. Restaurant reviews provide an important new test case, and resolving who a reviewer wants to address could have important implications for coreference resolution or sentiment analysis of reviews, as well as downstream tasks like information extraction.

Related Work
A number of closely related earlier papers have focused on disambiguating 'you' in English. Gupta et al. (2007b) annotated the Switchboard corpus of telephone dialogue, showing that features based on specific lexical patterns, adjacent partsof-speech, punctuation, and dialog acts are sufficient to achieve performance of 84.39% at the binary generic/referential prediction task. Gupta et al. (2007a) show that similar features generalize to addressee prediction for multi-party in-teractions significantly better than a simple baseline.  combine discourse features with acoustic and visual information for four-way interactions to resolve participant reference, and in the same setting Purver et al. (2009) employ cascaded classifiers that first establish referentiality and then attempt to resolve the referent. They show that utterance-level lexical features help, suggesting that different uses of 'you' are associated with distinct vocabularies. Reiter and Frank (2010) investigate the more general question of identifying genericity for noun phrases, showing the usefuleness of linguistic features such as syntactic dependency relations. Similar local structural cues like phrase-structure positioning, head word identity, and distance to surrounding clauses have been used as features in machine learning approaches for anaphoricity detection as one stage in a coreference resolution (Kong and Zhou, 2010;Zhou and Kong, 2011;Kong and Ng, 2013).
Prior work has also shown improvements in performance in the dialogue domain from incorporating features having to do with acoustic prosody, gaze, and head movements (Jovanović et al., 2006;Takemae and Ozawa, 2006;Gupta et al., 2007b;. Of course in the review domain we have no access to such information; as we'll see, however, we can exploit other unique properties of reviews to make up for this lack.

Data
We scrape reviews from dianping.com, a Chineselanguage restaurant review site, from the ten cities with the most reviews. We randomly sample 750 restaurants within each city and randomly sample reviews of those restaurants.
We scraped 346,381 reviews, including all associated metadata (city, restaurant category, and cost) for each restaurant, as well as the provided ratings (service, taste, ambience, and overall stars) for each review. Of these reviews only 6,704 (less than 2%) have the second-person pronominal character ni, highlighting another particular interest of this task: explicit second-person pronominals are quite rare in Chinese, at least in this genre, making the reviews in which they appear linguistically marked.
Summary statistics for this dataset are given in Table 1. We release all our data and annotations at nlp.stanford.edu/robvoigt/nis.

Preprocessing
We apply the Stanford CRF Word Segmenter (Tseng et al., 2005) to segment the text of each review into words, and use simple heuristics based on whitespace and punctuation to extract sentences or sentence fragments. The Stanford Parser (Klein and Manning, 2003;Levy and Manning, 2003) is then run on each extracted sentence or fragment containing a ni to produce a dependency graph and set of part-of-speech (POS) tags for later use in feature extraction.

Annotation
We hand-annotated 701 examples of ni tokens (including both singular and plural cases), placing them into one of seven categories: generic, writerreferential, reader-referential, shop-referential, idiomatic, non-"you", and other. The idiomatic and non-"you" cases are commonly comprised of set phrases such as 你 好 (nihao, "hello") or 迷 你 (mini, "mini") and are therefore relatively trivial to filter; and the "other" class is both rare and varied, including cases such as direct reference to prior review-writers.
We therefore only consider the generic and large-class referential cases, leaving us with 636 examples for our task; the distribution of annotated nis is shown in Table 2.
The approximately half-and-half split between generic and referential tokens is surprisingly similar to that found by studies on English dialogue like Gupta et al. (2007b), in spite of the large divergence in language and genre.
We also found an unexpected word-sense property of second-person pronouns in this genre: of the 122 annotated reviews which contain more than one ni, 83.6% use ni with the same sense in each occurrence in the review, recalling the one-sense-per-discourse hypothesis of Gale et al. (1992). Finding that this discourse propertynormally predicated of word-sense in common nouns-occurs in pronouns suggests the use of features of the entire discourse in this task.

Features
We consider two primary types of features: "local" and "discourse".

Local Features
"Local" features model textual and linguistic properties of the immediate context of a given ni  mention, and were drawn from the large literature on referentiality, anaphoricity, and singletondetection: Word Identity This feature simply encodes the word-segmented identity of the word in which the current ni token is found, capturing cases such as the second-person plural 你们 (nimen, "you [plural]").
Adjacent POS Tags Following Gupta et al. (2007b), we include POS tag features for the single words immediately following and preceeding the ni token.
Dependencies We include binary features for the presence or absence of lexicalized dependency relations in which the given ni participates. As an example, for the phrase 你要推销菜 ("if you want to sell dishes"), we extract a feature for NSUBJ(推 销, 你) -you is the subject of the verb sell.
Lexical Context This feature set fires binary features for the presence or absence of words in the vocabulary within a three-word window on either side of the given ni token.

Discourse Features
The "discourse" category considers features that characterize the entire review, capturing the intuition that the classic one-sense-per-discourse property is likely to hold for a given review, so we expect that features on the entire text of the review will be relevant for prediction. This is a novel contribution of this work: we propose that in certain contexts (such as reviews), referentiality resolution can be interpreted in part as a text classification task.
Review N-grams These are binary features for the presence or absence of n-grams in the entire text of the review. We found that using a larger n than 1 caused overfitting on our relatively small dataset and reduced performance; therefore, results are reported using unigram features.
Review Vector Embedding To see if we can induce higher-level representations of the review text than simply binary n-gram features, we also train a document-level distributed vector representation (Le and Mikolov, 2014) on the entire corpus of reviews using the "doc2vec" implementation in GENSIM (Řehůřek and Sojka, 2010), and include 200 vector features per review: a 100-dimensional embedding learned on the entire document, as well as a 100-dimensional average embedding calculated by averaging the vectors for each word in the document. In experiments we found using both the document and the average vectors combined resulted in higher performance than either alone, so we report results in this setting.
Metadata In addition to discourse features, we also included features that encode the category, city, and estimated cost for each restaurant, as well as the service, taste, environment, and overall star rank ratings associated with a given review on a 5-point scale.

Experiments
We tested the effectiveness of these features at predicting genericity and reference for each ni token with multinomial logistic regression, as implemented in SCIKIT-LEARN (Pedregosa et al., 2011). We used two classification settings: a binary prediction of whether a given ni is referential or not, and a four-way prediction including distinctions between the three annotated referential targets. The results for each task are shown in Table 3.
In each case, we compare the performance of all local and discourse features, as well as several relevant subsets. One question we aim to address is whether our discourse-level n-gram and embed-  Table 3: Average ten-fold cross-validation classification accuracy for different feature sets on two tasks. "Local" refers to all feature sets described in Section 4.1. BINARY distinguishes generic and referential ni, FOUR-WAY distinguishes between generic and three referential senses.
ding features contribute similar information, so we test them both separately and together. We compare our results to a baseline of choosing the most common class for either task. We train and test models with ten-fold crossvalidation. In each fold, we use 80% of the data for training, 10% for development, and 10% for testing. For each feature set, we set the l2 regularization strength as a hyperparameter based on average cross-validation accuracy on the development data in each fold. All reported results are average cross-validation accuracy at that regularization strength on the test set in each fold.

Meta-analysis
To better understand the effectiveness of each feature set for this task, we perform a full ablation study by training a classifier on all 127 (2 7 − 1, ignoring the empty set) possible combinations of our 7 feature sets, and run a linear regression predicting the classification score from the feature sets used. This allows us to obtain estimates of the effect size and statistical significance for each set of features with reference to all the others. These results are shown in Table 4.
pact classification performance. Simple word identity features alone already provide surprising performance: the classifier learns that the singular ni is more likely to be generic while the plural 你们 often refers to people affiliated with the shop.
While local features alone achieve respectable performance (78.44% for binary genericity detection and 72.19% for four-way classification), we show that in the review context significant gains can be made from using a combination of local and discourse-level features, exploiting discourselevel indicators of referentiality and the fact that a one-sense-per-discourse assumption tends to hold with regards to the use of ni.
Analysis of learned feature weights in our highest-performing model also provides some interesting social insights. Reviews with a high overall star rank were more likely to use generic ni, and reviewers who thought highly of the restaurant's service as indicated by their quality-ofservice rating were more likely to use readerdirected referential ni.
Reviews with shop-directed referential ni were likely to use emotive sentence-final particles like 啊 (a), exclamation points, and question marks, just as question marks were among the strongest indicators of referential uses in the English "you"s in Gupta et al. (2007b). We also found that other pronouns like 我 (wo, "I") and 我 们 (women, "we"), as well as words of temporal sequencing 第 一 (diyi, "the first"), 又 (you, "again"), and 次 (ci, "[one] time") receive high weights for referential classes.
Combined with the observation that reviews containing ni simply tend to be much longer than those without (see Table 1), these results suggest a link to the narrative work of Jurafsky et al. (2014), who characterize negative reviews as narrative expositions of an individual bad experience.
For example, consider the following review containing a referential ni: The food and quantity was fine. The ambience need not be mentioned. But in spite of having been a bit late for lunch, we wouldn't have imagined you'd first turn off the lights, and then turn off the air conditioner. I'd like to ask: saving money on electricity like this, do you mean to imply that there's no need for us to pay for our meal?
While the immediate context suggests a referential interpretation (想问下你们省电了, literally "want to ask you [plural], saving electricity"), it is only when this mention is connected to elements of the entire discourse (the sequence of events, the first-person pronouns) that it becomes completely clear first that the mention is referential and second that it refers to the shop owner.
Furthermore, we found that when combined with local features, features derived from distributed representations of each document perform at least as well for this task as documentlevel n-grams, but at a much lower dimensionality. This suggests that these embeddings do successfully encode the information necessary to reproduce document-level distinctions in discourse types, such as between the personal narratives that often surround referential uses of ni and the abstract descriptions of generic uses.
Our meta-analysis shows that more linguistically motivated local features such as POS tags and dependency relations are substantially overshadowed in effectiveness by lexical and discourse features, although this may be due in part to reduced performance of these automatic taggers on the more colloquial language in online reviews.
Finally, this work challenges prior claims that spoken language is "more complex" than other genres with regards to referentiality. On the contrary: whereas in a spoken discourse the potential addressees are by default the participants, web texts such as the reviews studied here have no such default, and may include complex, creative, and domain-specific deictic reference that can be important for computational systems to address.