SubjQA: A Dataset for Subjectivity and Review Comprehension

Subjectivity is the expression of internal opinions or beliefs which cannot be objectively observed or verified, and has been shown to be important for sentiment analysis and word-sense disambiguation. Furthermore, subjectivity is an important aspect of user-generated data. In spite of this, subjectivity has not been investigated in contexts where such data is widespread, such as in question answering (QA). We therefore investigate the relationship between subjectivity and QA, while developing a new dataset. We compare and contrast with analyses from previous work, and verify that findings regarding subjectivity still hold when using recently developed NLP architectures. We find that subjectivity is also an important feature in the case of QA, albeit with more intricate interactions between subjectivity and QA performance. For instance, a subjective question may or may not be associated with a subjective answer. We release an English QA dataset (SubjQA) based on customer reviews, containing subjectivity annotations for questions and answer spans across 6 distinct domains.


Introduction
Subjectivity is ubiquitous in our use of language (Banfield, 1982;Quirk et al., 1985;Wiebe et al., 1999;Benamara et al., 2017), and is therefore an important aspect to consider in Natural Language Processing (NLP). For example, subjectivity can be associated with different senses of the same word. BOILING is objective in the context of hot water, but subjective in the context of a person boiling with anger (Wiebe and Mihalcea, 2006). The same applies to sentences in discourse contexts (Pang and Lee, 2004). While early work has shown subjectivity to be an important feature for low-level * JB and NB contributed equally to this work. tasks such as word-sense disambiguation and sentiment analysis, subjectivity in NLP has not been explored in many contexts where it is prevalent.
In recent years, there is renewed interest in areas of NLP for which subjectivity is important, and a specific topic of interest is question answering (QA). This includes work on aspect extraction (Poria et al., 2016), opinion mining (Sun et al., 2017) and community question answering (Gupta et al., 2019). Many of these QA systems are based on representation learning architectures. However, it is unclear whether findings of previous work on subjectivity still apply to such architectures, including transformer-based language models (Devlin et al., 2018;Radford et al., 2019).
The interactions between QA and subjectivity are even more relevant today as users' natural search criteria have become more subjective. Their questions can often be answered by online customer reviews, which tend to be highly subjective as well. Although QA over customer reviews have gained traction recently with the availability of new datasets and architectures (Gupta et al., 2019;Grail and Perez, 2018;Fan et al., 2019;Xu et al., 2019b;, these are agnostic with respect to how subjectivity is expressed in the questions and the reviews. Furthermore, the datasets are either too small (< 2000 questions) or have target-specific question types (e.g., yes-no). Consequently, most QA systems are only trained to find answers from factual data, such as Wikipedia articles and News (Rajpurkar et al., 2018;Reddy et al., 2019;Joshi et al., 2017;Trischler et al., 2017).
In this work, we investigate the relation between subjectivity and question answering (QA) in the context of customer reviews. As no such QA dataset exists, we construct a new dataset, SUB-JQA. In order to capture subjectivity, our data collection method builds on the recent developments in opinion extraction and matrix factorization, in-stead of relying on the linguistic similarity between the questions and the reviews (Gupta et al., 2019). SUBJQA includes over 10,000 English examples spanning 6 domains that cover both products and services. We find that a large percentage of the questions and, respectively, answers in SUBJQA are subjective. In our dataset, we found 73% of the questions are subjective and 74% of the answers are subjective. Experiments show that existing QA systems trained to find factual answers struggle to understand subjective questions and reviews. For instance, fine-tuning BERT (Devlin et al., 2018), a state-of-the-art QA model, yields 92.9% F 1 on SQuAD (Rajpurkar et al., 2016), but only achieves an average score of 74.1% F 1 across the different domains of SUBJQA.
We develop a subjectivity-aware QA model by extending an existing model in a multi-task learning paradigm. The model is trained to predict the subjectivity label and answer span simultaneously, and does not require subjectivity labels at test time.
We found our QA model achieves 76.3% F 1 on an average over different domains of SUBJQA.

Contributions
• We release a challenging QA dataset with subjectivity labels for questions and answers, spanning 6 domains; • We investigate the relationship between subjectivity and a modern NLP task; • We develop a subjectivity-aware QA model; • We verify the findings of previous work on subjectivity, using recent NLP architectures;

Subjectivity
Written text, as an expression of language, contains information on several linguistic levels, many of which have been thoroughly explored in NLP. 1 For instance, both the semantic content of text and the (surface) forms of words and sentences, as expressed through syntax and morphology, have been at the core of the field for decades. However, another level of information can be found when trying to observe or encode the so-called private states of the writer (Quirk et al., 1985). Examples of private states include the opinions and beliefs of a writer, and can concretely be said to not be available for verification or objective observation. It is  Figure 1: Our data collection pipeline this type of state which is referred to as subjectivity (Banfield, 1982;Banea et al., 2011).
Whereas subjectivity has been investigated in isolation, it can be argued that subjectivity is only meaningful given sufficient context. In spite of this, most previous work has focused on annotating words (Heise, 2001), word senses (Durkin and Manning, 1989;Wiebe and Mihalcea, 2006), or sentences (Pang and Lee, 2004), with the notable exception of Wiebe et al. (2005), who investigate subjectivity in phrases in the context of a text or conversation. The absence of work investigating broader contexts can perhaps be attributed to the relatively recent emergence of models in NLP which allow for contexts to be incorporated efficiently, e.g. via architectures based on transformers (Vaswani et al., 2017).
As subjectivity relies heavily on context, and we have access to methods which can encode such context, what then of access to data which encodes subjectivity? We argue that in order to fully investigate research questions dealing with subjectivity in contexts, a large-scale dataset is needed. We choose to frame this as a QA dataset, as it not only offers the potential to investigate interactions in a single contiguous document, but also allows interactions between contexts, where parts may be subjective and other parts may be objective. Concretely, one might seek to investigate the interactions between an objective question and a subjective answer.

Data Collection
We found two limitations of existing datasets and collection strategies that motivated us to create a new QA dataset to understand subjectivity in QA.
First, data collection methods (Gupta et al., 2019;Xu et al., 2019b) often rely on the linguistic similarity between the questions and the reviews (e.g. information retrieval). However, subjective questions may not always use the same words/phrases as the review. Consider the examples below. The answer span 'vegan dishes' is semantically similar to the question Q 1 . The answer to the more subjective question Q 2 has little linguistic similarity to the question.
Example 1 Q 1 : Is the restaurant vegan friendly? Review: ...many vegan dishes on its menu. Q 2 : Does the restaurant have a romantic vibe? Review: Amazing selection of wines, perfect for a date night.
Secondly, existing review-based datasets are small and not very diverse in terms of question topics and types (Xu et al., 2019a;Gupta et al., 2019). We, therefore, consider reviews about both products and services from 6 different domains, namely TripAdvisor, Restaurants, Movies, Books, Electronics and Grocery. We use the data of Wang et al. (2010) for TripAdvisor, and Yelp 2 data for Restaurants. We use the subsets for which an opensource opinion extractor was available . We use the data of McAuley and Yang (2016) that contains reviews from product pages of Amazon.com spanning multiple categories. We target categories that had more opinion expressions than others, determined by an opinion extractor. Figure 1 depicts our data collection pipeline which builds upon recent developments in opinion extraction and matrix factorization. An opinion extractor is crucial to identify subjective or opinionated expressions, which other IR-based methods cannot. On the other hand, matrix factorization helps identify which of these expressions are related based on their co-occurrence in the review corpora, instead of their linguistic similarities. To the best of our knowledge, we are the first to explore such a method to construct a challenging subjective QA dataset.
Given a review corpus, we extract opinions about various aspects of the items being reviewed (Opinion Extraction). Consider the following review snippets and extractions.
Example 2 Review: ..character development was quite impressive. e 1 :‹'impressive', 'character development'› 2 https://www.yelp.com/dataset  In the next (Neighborhood Model Construction) step, we characterize the items being reviewed and their subjective extractions using latent features between two items. In particular, we use matrix factorization techniques (Riedel et al., 2013) to construct a neighborhood model N via a set of weights w e,e , where each corresponds to a directed association strength between extraction e and e . For instance, e 1 and e 2 in Example 2 could have a similarity score 0.93. This neighborhood model forms the core of data collection. We select a subset of extractions from N as topics (Topic Selection) and ask crowd workers to translate them to natural language questions (Question Generation). For each topic, a subset of its neighbors from N and reviews which mention them are selected (Review Selection). In this manner, question-review pairs are generated based on the neighborhood model.
Finally, we present each question-review pair to crowdworkers who highlight an answer span in the review. Additionally, they provide subjectivity scores for both the questions and the answer span.

Opinion Extraction
An opinion extractor processes all reviews and finds extractions ‹X,Y› where X represents an opinion expressed on aspect Y. Table 1 shows sample extractions from different domains. We use OpineDB , a state-of-the-art opinion extractor, for restaurants and hotels. For other domains where OpineDB was not available, we use the syntactic extraction patterns of Abbasi Moghaddam (2013).

Neighborhood Model Construction
We rely on matrix factorization to learn dense representations for items and extractions, and identify similar extractions. As depicted in Figure 2, we organize the extractions into a matrix M where each row i corresponds to an item being reviewed and For each extraction, we find its neighbors based on the cosine similarity of their embeddings. 3

Topic and Review Selection
We next identify a subset of extractions to be used as topics for the questions. In order to maximize the diversity and difficulty in the dataset, we use the following criteria developed iteratively based on manual inspection followed by user experiments. 1. Cosine Similarity: We prune neighbors of an extraction which have low cosine similarity (< 0.8). Irrelevant neighbors can lead to noisy topic-review pairs which would be marked nonanswerable by the annotators. 2. Semantic Similarity: We prune neighbors that are linguistically similar (> 0.975 similarity 4 ) as they yield easy topic-review pairs. 3. Diversity: To promote diversity in topics and reviews, we select extractions which have many ( > 5) neighbors. 4. Frequency: To ensure selected topics are also popular, we select a topic if: a) its frequency is higher than the median frequency of all extractions, and b) it has at least one neighbor that is more frequent than the topic itself. We pair each topic with reviews that mention one of its neighbors. The key benefit of a factorizationbased method is that it is not only based on linguistic similarity, and forces a QA system to understand subjectivity in questions and reviews.
3 Details about hyper-parameters are included in the Appendix. 4 using GloVe embeddings provided by Spacy

Question Generation
Each selected topic is presented to a human annotator together with a review that mentions that topic. We ask the annotator to write a question about the topic that can be answered by the review. For example, ‹'good', 'writing'› could be translated to "Is the writing any good?" or "How is the writing?".

Answer-Span and Subjectivity Labeling
Lastly, we present each question and its corresponding review to human annotators (crowdworkers), who provides a subjectivity score to the question on a 1 to 5 scale based on whether it seeks an opinion (e.g., "How good is this book?") or factual information (e.g., "is this a hard-cover?"). Additionally, we ask them to highlight the shortest answer span in the review or mark the question as unanswerable. They also provide subjectivity scores for the answer spans. We provide details of our neighborhood model construction and crowdsourcing experiments in the Appendix.

Dataset Analysis
In this section, we analyze the questions and answers to understand the properties of our SUBJQA dataset. We present the dataset statistics in Section 4.1. We then analyze the diversity and difficulty of the questions. We also discuss the distributions of subjectivity and answerability in our dataset. Additionally, we manually inspect 100 randomly chosen questions from the development set in Section 4.3 to understand the challenges posed by subjectivity of the questions and/or the answers.

Difficulty and Diversity of Questions
As can be seen in Table 3, reviews in different domains tend to vary in length. Answer spans tend to be 6-7 tokens long, compared to 2-3 tokens in SQuAD. Furthermore, the average linguistic similarity of the questions and the answer spans was low: 0.7705 computed based on word2vec. These characteristics of SUBJQA contribute to making it an interesting and challenging QA dataset. Table 4 shows the number of distinct questions and topics in each domain. On average we collected 1500 questions covering 225 aspects. We also automatically categorize the boolean questions based on a lexicon of question prefixes. Unlike other review-based QA datasets (Gupta et al., 2019), SUBJQA contains more diverse questions, the majority of which are not yes/no questions. The questions are also linguistically varied, as indicated by the trigram prefixes of the questions (Figure 3). Most of the frequent trigram prefixes in SUBJQA (e.g., how is the, how was the,how do you) are almost missing in SQuAD and Gupta et al. (2019). The diversity of questions in SUBJQA demonstrate challenges unique to the dataset.

Data Quality Assessment
We randomly sample 100 answerable questions to manually categorize them according to their reasoning types. Table 5 shows the distribution of the reasoning types and representative examples. As expected, since a large fraction of the questions are subjective, they cannot be simply answered using a keyword-search over the reviews or by paraphras- Figure 3: The distribution of prefixes of questions. The outermost ring shows unigram prefixes (e.g., 57.9% questions start with how). The middle and innermost rings correspond to bigrams and trigrams, respectively.
ing the input question. Answering such questions requires a much deeper understanding of the reviews. Since the labels are crowdsourced, a small fraction of the answer spans are noisy.
We also categorized the answers based on answer-types. We observed that 64% of the answer spans were independent clauses (e.g., the staff was very helpful and friendly), 25% were noun phrases (e.g., great bed) and 11% were incomplete clauses/spans (e.g., so much action). This supports our argument that often subjective questions cannot be answered simply by an adjective or noun phrase.

Answerability and Subjectivity
The dataset construction relies on a neighborhood model generated automatically using factorization. It captures co-occurrence signals instead of linguistic signals. Consequently, the dataset generated is not guaranteed to only contain answerable questions. As expected, about 65% of the questions in the dataset are answerable from the reviews (see Table 7). However, unlike Gupta et al. (2019), we do not predict answerability using a classifier. The answerability labels are provided by the crowdworkers instead, and are therefore more reliable. Table 7 shows the subjectivity distribution in questions and answer spans across different domains. A vast majority of the questions we collected are subjective, which is not surprising since we selected topics from opinion extractions. A large fraction of the subjective questions (∼70%) were also answerable from their reviews.    We also compare the subjectivity of questions with the subjectivity of answers. As can be seen in Table 6, the subjectivity of an answer is strongly correlated with the subjectivity of the question. Subjective questions often have answers that are also subjective. Similarly, factual questions, with few exceptions, have factual answers. This indicates that a QA system must understand how subjectivity is expressed in a question to correctly find its answer. Most domains have 75% subjective questions on average. However, the BERT-QA model fine-tuned on each domain achieves 80% F1 on subjective questions in movies and books, but only achieves 67-73% F1 on subjective questions in grocery and electronics. Future QA systems for user-generated content, such as for customer support, should therefore model subjectivity explicitly.

Subjectivity Modeling
We now turn to experiments on subjectivity, first investigating claims made by previous work, and whether they still hold when using recently developed architectures, before investigating how to model subjectivity in QA. Pang and Lee (2004) have shown that subjectivity is an important feature for sentiment analysis. Sorting sentences by their estimated subjectivity scores, and only using the top n such sentences, allows for a more efficient and better-performing sentiment analysis system, than when considering both subjective and objective sentences equally. We first investigate whether the same findings hold true when subjectivity is estimated using transformer-based architectures. Our setup is based on a pre-trained BERT-based uncased model. 5 Following the approach of Devlin et al. (2018), we take the final hidden state corresponding to the special [CLS] token of an input sequence as its representation. We then predict the subjectivity of the sentence by passing its representation through a feed-forward neural network, optimized with SGD. We compare this with using subjectivity scores of TEXTBLOB 6 , a sentiment lexicon-based method, as a baseline. We consider sentences with a high TextBlob subjectivity score (> 0.5) as subjective.

Subjectivity in Sentiment Analysis
We evaluate the methods on subjectivity data from Pang and Lee (2004) 7 and the subjectivity labels made available in our dataset (SUBJQA). Unsurprisingly, a contextually-aware classifier vastly outperforms a word-based classifier, highlighting the importance of context in subjectivity analysis (see Table 8). Furthermore, predicting subjectivity in SUBJQA is more challenging than in IMDB, because SUBJQA spans multiple domains.
We further investigate if our subjectivity classifier helps with the sentiment analysis task. We implement a sentiment analysis classifier which takes   (Pang and Lee, 2004) and our dataset (SUBJQA). the special [CLS] token of an input sequence as the representation. We train this classifier by replicating conditions described in Pang and Lee (2004). As shown in Figure 4, giving a contextually-aware subjectivity classifier access to N subjective sentences improves the performance on sentiment analysis, outperforming a baseline of using all sentences, and N objective sentences.

Subjectivity-Aware QA Model
Given our importance of subjectivity in other NLP tasks, we investigate whether it is also an important feature for QA using SUBJQA. We approach this by implementing a subjectivity-aware QA model, as an extension of one of our baseline models in a multitask learning (MTL) paradigm (Caruana, 1997). One advantage of using MTL is that we do not need to have access to subjectivity labels at test time, as would be the case if we required subjectivity labels as a feature for each answer span. We base our model on FastQA (Weissenborn et al., 2017). Each input paragraph is encoded with a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) over a sequence of word embeddings and contextual features (X). This encoding, H , is passed through a hidden layer and a non-linearity: We extend this implementation by adding two hidden layers of task-specific parameters (W n ) associated with a second learning objective: In training, we randomly sample between the two tasks (QA and Subjectivity classification).

Baselines
We use four pre-trained models to investigate how their performances on SUBJQA compare with a factual dataset, SQuAD (Rajpurkar et al., 2016), created using Wikipedia. Specifically, we evaluate BiDaF (Seo et al., 2017), FastQA (Weissenborn et al., 2017), JackQA (Weissenborn et al., 2018) 8 and BERT (Devlin et al., 2018), 9 all pre-trained on SQuAD. Additionally, we fine tune the models on each domain in SUBJQA. Figure 5 shows the F1 scores of the pre-trained models. We report the Exact match scores in Appendix A.1. Pre-trained models achieve F1 scores   Figure 6 shows the absolute gains in F1 scores of models fine-tuned on specific domains, over the pre-trained model. After fine-tuning on each domain, the best model achieves an average F1 of 74.1% across the different domains, with a minimum of 63.3% and a maximum of 80.5% on any given domain. While fine-tuning significantly boosts the F1 scores in each domain, they are still lower than the F1 scores on the SQuAD dataset. We argue that this is because the models are agnostic about subjective expressions in questions and reviews. To validate our hypothesis, we compare the gain in F1 scores of the BERT model on subjective questions and factual questions. We find that the difference in F1 gains is as high as 23.4% between factual and subjective questions. F1 gains differ by as much as 23.0% for factual vs. subjective answers.

Subjectivity-Aware Modeling
After fine-tuning over each domain in the MTL setting, the subjectivity-aware model achieves an average F1 of 76.3% across the different domains, with a minimum of 58.8% and a maximum of 82.0% on any given domain. Results from the subjectivityaware model are shown in Table 9. Under both the F1 and the Exact match metrics, incorporating subjectivity in the model as an auxiliary task boosts performance across all domains. Although there are gains also for subjective questions and answers, it is noteworthy that the highest gains can be found for factual questions and answers. This can be explained by the fact that existing techniques already are tuned for factual questions. Our MTL extension helps in identifying factual questions, which further improves the results. However, even if subjective questions are identified, the system is still not tuned to adequately deal with this input.

Related Work
We are witnessing an exponential rise in usergenerated content. Much of this data contains subjective information ranging from personal experiences to opinions about a specific aspects of a product. This information is useful for supporting decision making in product purchases. However, subjectivity has largely been studied in the context of sentiment analysis (Hu and Liu, 2004) and opinion mining (Blair-Goldensohn et al., 2008), with a focus on text polarity. There is a renewed interested in incorporating subjective opinion data into a general data management system Kobren et al., 2019) and providing an interface for querying subjective data. These systems employ trained components for extracting opinion data, labeling it and even responding to user questions.
In this work, we revisit subjectivity in the context of review QA. McAuley and Yang (2016); Yu et al. (2012) also use review data, as they leverage question types and aspects to answer questions. However, no prior work has modeled subjectivity explicitly using end-to-end architectures.
Furthermore, none of the existing reviewbased QA datasets are targeted at understanding subjectivity. This can be attributed to how these datasets are constructed. Large-scale QA datasets, such as SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017), CoQA (Reddy et al., 2019) are based on factual data. We are the first to attempt to create a review-based QA dataset for the purpose of understanding subjectivity.

A Appendices
A.1 Additional Experimental Results Figure 7 shows the exact scores achieved by the pretrained out-of-the-box models on various domains in SUBJQA. Figure 8 shows the exact scores of the models fine-tuned on each domain in SUBJQA.

A.2 Neighborhood Model Construction
For constructing the matrix for factorization, we focus on frequently reviewed items and frequent extractions. In particular, we consider items which have more than 10,000 reviews and extractions that were expressed in more than 5000 reviews. Once the matrix is constructed, we factorize it using nonnegative factorization method using 20 as the dimension of the extraction embedding vector.
In the next step, we construct the neighborhood model by finding top-10 neighbors for each extraction based on cosine similarity of the extraction and the neighbor. We further select topics from the extractions, and prune the neighbors based on the criteria we described earlier.
A.3 Crowdsourcing Details Figure 9 illustrates the instructions that were shown to the crowdworkers for the question generation task. Figure 10 shows the interface for the answerspan collection and subjectivity labeling tasks. The workers assign subjectivity scores (1-5) to each question and the selected answer span. They can also indicate if a question cannot be answered from the given review.