Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN

Video reviews are the natural evolution of written product reviews. In this paper we target this phenomenon and introduce the first dataset created from closed captions of YouTube product review videos as well as a new attention-RNN model for aspect extraction and joint aspect extraction and sentiment classification. Our model provides state-of-the-art performance on aspect extraction without requiring the usage of hand-crafted features on the SemEval ABSA corpus, while it outperforms the baseline on the joint task. In our dataset, the attention-RNN model outperforms the baseline for both tasks, but we observe important performance drops for all models in comparison to SemEval. These results, as well as further experiments on domain adaptation for aspect extraction, suggest that differences between speech and written text, which have been discussed extensively in the literature, also extend to the domain of product reviews, where they are relevant for fine-grained opinion mining.


Introduction
On-line videos have become indispensable to people's daily lives, as traffic statistics showed that by 2010 it accounted for 56.6% of the total global consumer traffic (Siersdorfer et al., 2010). Studies support the notion that on-line reviews can have a strong influence in the decision-making of potential Internet buyers (Chevalier and Mayzlin, 2006), thus becoming a major factor for both consumers and marketers (Hu et al., 2008).
Video reviews are the natural evolution of written product reviews. In fact, people are increasingly turning to platforms such as YouTube to help them shop, looking for product reviews (Lawson, 2015). YouTube unboxing videos have become a growing phenomenon (Lawson, 2015;Insights, 2014). In 2015 alone, people in the U.S. watched 60M hours of them on YouTube, totaling 1.1 B views. The same year, views of product review videos increased by 40% compared to 2014, and more than 1 million channels related to product reviews were counted (Baysinger, 2015). Despite all of this, the most widely used approaches in opinion mining focus only on tweets or written product reviews available on websites like Amazon. Therefore, in this paper we present the first opinion mining study focusing on video product reviews. We take the fine-grained approach, which aims to detect the subjective expressions in text and to characterize their sentiment orientation, and analyze the closed captions of video product reviews extracted from YouTube. Fine-grained opinion mining is important for a variety of NLP problems, including opinion-oriented question answering and opinion summarization, having been studied extensively in recent years. In practical terms, this approach defines the tasks of aspect extraction (AE), sentiment classification (SC) and a joint setting (AESC).
While AE and AESC have often been tackled as sequence labeling problem, where the sentence is a stream of tokens to be labeled using IOB and collapsed or sentiment-bearing IOB labels (Zhang et al., 2015) respectively, SC can be regarded as a semantic compositional problem, where the obtained representation is used to predict the sentiment.
Accounting for the patent differences between speech and written text, which have also led linguists to consider them as different domains (Biber, 1991) exhibiting different syntactic (O'Donnell, 1974) and distributional properties, we created the first annotated dataset using closed captions of YouTube product review videos, which we named the Youtubean dataset.
Motivated by the success of attention-based approaches in multiple NLP problems such as machine translation (Bahdanau et al., 2015), parsing (Vinyals et al., 2015), slot-filling (Liu and Lane, 2016) and others (Luong et al., 2015), we also introduce an attention-augmented RNN model for AE and AESC. Compared to previous work, the attentional component makes our model specially suitable for AESC, since it directly addresses the compositional nature of the sentiment classification task as it allows the model to represent the input sentence as a convex combination of word representations. This is confirmed by our results on the SemEval ABSA dataset (Pontiki et al., 2014), given that our model offers state-of-the-art performance for AESC while also performing equivalently to the state-of-the-art for aspect extraction without the need for manually-crafted features.
We also show that our attention-RNN model outperforms the baseline for both AE and AESC on our dataset. However, we observed that compared to the SemEval corpora, all the tested models decreased their performance on it. As indicated by a descriptive analysis of our corpus and by additional experiments using domain adaptation techniques for AE, which did not offer considerable gains, our results seem to support the existence of the aforementioned differences between speech and written text in the context of product reviews and their importance for fine-trained opinion mining. Our code and data are available for download on GitHub 1 .

Related Work
Our work is related to aspect extraction using deep learning, a task that is often tackled as a sequence labeling problem. In particular, our work is related to Irsoy and Cardie (2014), who pioneered in the field by using multi-layered RNNs on a subset of the MPQA 1.2 dataset (Wiebe et al., 2005). Later, Liu et al. (2015) successfully adapted the architectures by Mesnil et al. (2013), experimenting on the SemEval 2014 dataset (Pontiki et al., 2014). Compared to these, our model is novel since it introduces the usage of attention for AE. In this sense, our work is also related to Liu and Lane (2016), who introduced an attention RNN for slot-filling in Natural Language Understanding. 1 github.com/epochx/opinatt We also find related work on the usage of RNNs for open domain targeted sentiment (Mitchell et al., 2013), where Zhang et al. (2015) experimented with neural CRF models using various RNN architectures on a dataset of informal language from Twitter. In our case, the domain is different since we focus on product reviews.
Regarding target-based sentiment analysis, we find several ad-hoc models that account for the sentence structure and the position of the aspect on it, such as Tang et al. (2016b) and Tang et al. (2016a), who use attention-augmented RNNs for the task. However, these models require the location of the aspect to be known in advance and therefore are only useful in pipeline models. Our work is similar to these since it also makes use of an attentional component to model compositionally in sentiment classification, but we model aspect extraction and sentiment classification as a joint task instead of using a pipeline approach.
AESC has also often been tackled as a sequence labeling problem, mainly using CRFs (Mitchell et al., 2013). To model the problem in this fashion, collapsed or sentiment-bearing IOB labels (Zhang et al., 2015) are used. Pipeline models (i.e. taskindependent model ensembles) have also been extensively studied by the same authors. We also find Xu et al. (2014) who performed AESC by modeling the linking relation between aspects and the sentiment-bearing phrases.
When it comes to the video review domain, we find related work on YouTube mining, mainly focused on exploiting user comments. For example, Wu et al. (2014) exploited crowdsourced texual data from time-synced commented videos, proposing a temporal topic model based on LDA. However, Schultes et al. (2013) showed that comments with references to video content 2 represent only 2% to 4% of comments in YouTube. Therefore, we think this kind of analysis might be limited. The work of Tahara et al. (2010) introduced a similar approach for Nico Nico using time-indexed social annotations to search for desirable scenes inside videos.
On the other hand, Severyn et al. (2014) proposed a systematic approach to mine user comments that relies on tree kernel models. Additionally, Krishna et al. (2013) performed sentiment analysis on YouTube comments related to popular topics using machine learning techniques, show-  ing that the trends in users' sentiments is well correlated to the corresponding real-world events. Siersdorfer et al. (2010) presented an analysis of dependencies between comments and comment ratings, proving that community feedback in combination with term features in comments can be used for automatically determining the community acceptance of comments. Finally, we find some papers that have successfully attempted to use closed caption mining for video activity recognition (Gupta and Mooney, 2010) and scene segmentation (Gupta and Mooney, 2009). Similar work has been done using closed captions to classify movies by genre (Brezeale and Cook, 2006) and summarize video programs (Brezeale and Cook, 2006).

Dataset
In YouTube, video authors con provide their own closed captions, or they can be generated automatically by the engine. In both cases, these captions can be interpreted as a time-indexed transcript of the speech in the video. Therefore, to minimize the amount of noise in the data, we utilized the user-provided closed captions of seven of the most popular reviews of the Samsung Galaxy S5 and creatd an annotated dataset for fine-grained opinion mining. We obtained, cleaned and processed the data, and annotated the aspects following the guidelines by Pontiki et al. (2014) using brat 3 (Stenetorp et al., 2012). We divided the annotation process into two steps.
First, two different annotators tagged aspects independently, obtaining an exact inter-annotation agreement of 0.705 F1-score. This value rose to 0.823 when allowing for partial matches, which we defined as any overlap between the annotated terms. Discrepancies were discussed until a final setting was reached.
With these annotations fixed, we asked the same annotators to tag the sentiment of each extracted aspect. On this task, the annotators obtained an average agreement of 0.942 F1-score. This time, discrepancies were discussed with a third person who acted as an arbiter, until an agreement was reached. Both aspect extraction and sentiment classification inter-annotator agreements are comparable to the values obtained in similar tasks (Jimenez-Zafra et al., 2015) (Wiebe et al., 2005).  Table 2: Descriptive corpora comparison. Table 1 provides some key information about the the source video reviews we have used to build our dataset, which we named the Youtubean dataset. Table 2 compares it to the SemEval Laptops and Restaurants corpora, regarded as the de facto datasets for written review mining. Several differences can be observed. A big distinction lies in mean sentence and aspect lengths, both of which are considerably longer in Youtubean. We also analyzed sentence syntax complexity in terms of the constituency tree depth, observing that our sentence trees are deeper on average. Furthermore, Youtubean exhibits both longer and more frequent aspect mentions.

Proposed Model
Our proposed model is a two-pass bidirectional RNN architecture that includes an attentional component. Formally, given an embedded input sequence x = [x 1 , ..., x n ] with one-hot encoded labels y = [y 1 , ..., y n ], we define the first pass as follows.x Where σ denotes the sigmoid nonlinearity, h i and h i are the forward and backward hidden states of the RNN, which are concatenated, andx i is a context window of ordered word embedding vectors around position i, with a total size of 2d + 1. This context window is intended to improve the model capabilities to capture short-term temporal dependencies (Mesnil et al., 2013). The second pass goes through the hidden states h i and performs sequence labeling token by token. We use the attentional decoder from (Vinyals et al., 2015).
Whereŷ i is a probability distribution over the label vocabulary for input i. As shown, this is obtained using both the corresponding aligned input h i and the attention distribution over all hidden states t i , i.e. using a global attention scheme (Luong et al., 2015). While generating the outputŷ i , we explicitly model the dependency on the previous label by adding y i−1 to the computation. These two components are combined using a feed forward neural network, whose output dimension is the size of the tag label vocabulary for AE or AESC. To initialize the attention matrix h n is used so the model does not bypass it. As a loss function we use the minibatch average cross-entropy. The addition of the attentional component to our model is motivated by two factors. In the first place, in contrast to Mesnil et al. (2013) who directly make use of a window of previous hidden states for AE, the attentional components allows us to access contextual information in a more natural and selective way. For AESC, the attention directly models sentiment compositionality.

Experimental setup
For our experiments, in addition to Youtubean, we also worked with the SemEval ABSA 2014 Laptops and Restaurants corpora (Pontiki et al., 2014), which can be regarded as the de facto datasets for fine-grained review mining. For AE we use the train/test splits provided for Phase B. For AESC, since the test data does not have sentiment labels, we worked only with the training data. On the other hand, since the size of Youtubean is smaller than the SemEval corpora, we used 5-fold cross validation to make results more robust. For each fold, we used 10% of the development data as a validation set and compare our results using twosided t-tests.
For evaluation, we used the CoNLL conlleval script for evaluation based on F1-score. To perform joint aspect extraction and sentiment classification, we only considered positive (+), negative (−) and neutral (0) as sentiment classes, and the additional conflict class is mapped to neutral. To gain insights on the output of the models for AESC, we decoupled the IOB collapsed tags using simple heuristics to recover the simple aspect extraction F1-score as well as classification performances for each sentiment class, but we used the joint tagging conlleval F1-score to evaluate the models.
As a baseline, we implemented the RNN architectures by Liu et al. (2015), which are the stateof-the-art in fine-grained aspect extraction. We experimented with Jordan-style RNNs (JRNN), Elman-style RNNs (RNN), LSTMs and the bidirectional versions of these last two. We followed Irsoy and Cardie (2014) to merge the forward and backward hidden states, setting y t = σ( U h t + U h t ), where U , U are output matrices for the forward and backward hidden states h t , h t , respectively. This gives the models more flexibility to capture complex relations in a sentence, making them able to learn how to weight future and past information.
For both our attention-RNN model and the baseline RNNs, we experimented with Senna embeddings (Collobert et al., 2011), GoogleNews embeddings (Mikolov et al., 2013) and WikiDeps (Levy and Goldberg, 2014). The usefulness of working with pre-trained embeddings for the baseline RNNs was already shown by (Liu et al., 2015). However, for comparison when experimenting with our model, we also used randomly initialized embeddings of sizes 50 and 300 to test this hypothesis.
To make our results more transparent, we explicitly experimented with two different preprocessing pipelines. We used Senna (Collobert et al., 2011), which provides both a POS-tagger and a chunker, and CoreNLP (Manning et al., 2014). The latter lacks a chunker so we combined it with the CoNLL chunklink script 4 . As Liu et al. (2015), we also experimented adding the same 14 linguistic binary features they used, which are based on POS-tags and chunk IOB-tags. These are concatenated to the hidden layer of the RNN before the final output non-linearity.
To train our baseline models we set a learning rate of 0.01 with decay and early stopping on the validation set. We set a fixed window size of 1 for bi-directional and 3 for unidirectional models, and always train word embeddings. Exploratory experiments showed that most models stop learning after a few epochs -3 or 4-so we only trained for a maximum of 5 epochs.
In the case of our attention-RNN model (ARNN), here we only report results using LSTMs, which outperformed all others cells we tried on preliminary experiments. We explored different hyper-parameter configurations, including context window sizes of 1, 3 and 5 as well as hidden state sizes of 100, 200 and 300, and dropout keep probabilities of 0.5 and 0.8. We also experimented concatenating the RNN hidden states after the first pass with the binary features used by (Liu et al., 2015). Finally, we also experimented with unidirectional versions of the RNNs. For training, we used mini-batch stochastic gradient descent with a mini-batch size of 16 and padded sequences to a maximum size of 200 tokens. We used exponential decay of ratio 0.9 and early stopping on the validation when there was no improvement in the F1-score after 1000 training steps.

Aspect Extraction (AE)
6.1.1 Laptops Table 3 summarizes our best baseline results on the Laptops datasets. For contrast we include the best F1-scores obtained by Liu et al. (2015) (cf. F1* columns). We observed the CoreNLP pipeline outperformed the Senna pipeline, with an average absolute gain of 2.105%, significant at p = 1.29 × 10 −5 , and binary features proved useful offering average absolute gains of 1.538% (p = 1.29 × 10 −5 ). Finally, note that the best configurations always use SennaEmbeddings, which 4 http://ilk.uvt.nl/team/sabine/homepage/software.html outperformed others significantly for each case.  Table 3: Results of our implemented baseline RNN models on the Laptops dataset. Table 4 summarizes the best results of our ARNN model on the Laptops dataset, where we obtained a maximum F1-score of 74.74. Again, the CoreNLP pipeline significantly outperformed Senna, with an average absolute gain of 1.39 (p = 3.4 × 10 −33 ) F1-score. Bidirectionality provided an absolute average gain of 1.15 F1-score (p = 4.61 × 10 − 20).
Both SennaEmbeddings and GoogleNews provided statistically equivalent results (p = 0.65), which were also significantly superior to WikiDeps with p-values 9.54 × 10 17 and 2.6 × 10 −13 respectively. Pre-trained embeddings outperformed random embeddings on average, comparing across same-sized cases. Linguistic binary features did not statistically contribute to the performance.  Regarding the usage of the linguistic features, we found that they contributed to increasing performance with an average absolute gain of 1.083% (p = 1.65 × 10 −6 ). This is consistent with previous findings by Liu et al. (2015). The Senna pipeline outperformed CoreNLP with an average absolute gain of 1.161% (p = 1.02 × 10 −8 ). Embeddings caused statistically significant differences, where WikiDeps outperformed both other embeddings on average.    Context windows proved beneficial as confirmed by the significantly different average F1scores of 76.55, 77.59 and 77.28 for window sizes 1, 3 and 5 respectively. We also observed significant performance differences using SennaEmbeddings, which outperformed all others with an average F1-score of 77.94. GoogleNews and WikiDeps exhibited average F1-scores of 76.93 and 76.55, which are statistically different (p = 4.08 × 10 −6 ) and although they also outperformed random embeddings for d = 300, they performed statistically worse than random embeddings for d = 50. Linguistic binary features did not statistically contribute to the performance. Table 7 summarizes our results for baseline RNNs on Youtubean. Again, we observed that adding linguistic features had a positive effect on the performance, with an average absolute gain of 1.30% (p = 0.01). SennaEmbeddings and WikiDeps provided better performance compared to Google-News, with average F1-scores of 49.11, 49.64 and 45.37 respectively. The first two values were statistically indistinguishable. We could not observe significant differences in the performance for different pipelines.  Table 7: Results of our implemented baseline RNN models on AE for the Youtubean dataset.

Youtubean
To further study the relation between written and video product reviews for aspect extraction, a task that has been broadly studied by our community, we complemented our RNNs baseline with two classic domain adaptation methods. Despite their simplicity, they are surprisingly difficult to beat (Daume III and Marcu, 2006). These techniques basically mean using each of the SemEval corpora as a source (SRC) dataset for transfer learning, where Youtubean is set as the target (TGT).
Our first domain-adaptation technique was WEIGHTED, a method that trains a model on the union of the SRC and TGT datasets, reweighting examples from SRC (Daume III and Marcu, 2006). We did so by multiplying the input embedding matrix by the given weight w, which we set to 0.2 based on the corpus size ratio. For training, we used 10-fold cross validation, adding all the examples of the SRC dataset to the training part of each fold-based arrangement. Since these model took longer to train we only experimented with the Senna pipeline. We omitted our bidirectional architectures given their poor performance and always included linguistic features, which generally contributed to an improved F1score in our in-domain models.  As Table 8 shows, using the Laptops dataset as SRC gives the best results in each case. Using this corpus led to an average absolute improvement over Restaurants of 3.79% (p = 7.76 × 10 −11 .) When it comes to embeddings, GoogleNews provided the best average performance with 53.44 F1score. However, this value was statistically indistinguishable at p < 0.08 from WikiDeps, with an average 52.8 F1-score.
Our second domain adaptation method was PRED, which uses the output of a SRC-trained classifier as a feature in the TGT model. Concretely, we first trained a model using all the examples on SRC. We then fed that model with all the TGT examples, adding its outputs as additional features to the TGT dataset, thus creating a new augmented version of it. Since these features are IOB-tags, we concatenate them with the linguistic features. We trained our models on the augmented TGT dataset, choosing the best performing settings from our previous experiments on AE.  Table 9: Results for the PRED technique. Table 9 summarizes our results for PRED. We found that using Senna as the pre-processor provided better results in average, with an 0.89% absolute gain significant at p = 0.01. The Restaurants dataset provided better results than Laptops in average, with an absolute gain of 3.23%, significant at p = 8.78 × 10 −6 .
Finally, Table 10 shows our best results for our introduced ARNN in the Youtubean dataset. For this case, we omitted random embeddings and binary features as previous experiments showed they did not contribute to increase the performance.

Joint aspect extraction and sentiment classification (AESC)
On our experiments for this task we based our parameter settings on the results for AE, so we only used bidirectional ARNN models, and skipped binary features and random embeddings.   On Youtubean, as Table 13 shows, we see important performance drops compared to SemEval. In particular, the baseline models seem to be unable to correctly classify negative aspects. For this dataset, we found out that Senna provides better results than CoreNLP with an average absolute gain of 3.94 F1-score, which was significant at p = 2.5 × 10 −4 . Embeddings did not provide statistically significant differences. Similarly, binary features did not statistically contribute to the performance either.

Discussion
Results for aspect extraction showed that our implemented RNN baseline performs similarly to the original models by (Liu et al., 2015), although we remained unable to replicate their exact numbers. Despite that, our attention-RNN is able to provide results that are better than our implementation and comparable to the original values for both Laptops and Restaurants datasets. Moreover, we achieved these results without the need to add the linguistic features, which did not offer significant performance differences in our experiments. We think the variable sentence representation introduced by the attentional component is able to model some of the semantics encoded in these binary features. For aspect extraction in our dataset, we see our model is able to perform better than the baseline, again without the need to add manuallycrafted features. However, simple domain adaptation techniques applied to the baseline RNNs managed to obtain the best results, adding a maximum of 3.56 F1-score over the baseline. We think this shows that video reviews and written reviews share some regularities, which could be exploited further to obtain better results. In this sense, it would be interesting to apply these domain adaptations techniques to our attention-RNN model and compare the results. However, regularities among these domains seem to be limited, given that our obtained gains were small and that no domain consistently delivered better performance.
Regarding AESC, as shown by our decoupled results, we see all models slowly decreased their performance for aspect extraction, compared with results for AE. This seems reasonable given the additional challenges of performing both tasks at the same time.
When it comes to sentiment classification, we see our attention-RNN outperforms the baseline RNNs by a solid margin. However, all models tend to perform poorly for the negative (−) class. We believe this may be related to the imbalanced nature of the datasets, or due to the additional composition challenges negation involves, which seem to be critical in our dataset. Compared to the baseline RNNs, which in some cases seemed basically unable to detect negative sentiment, our attention-RNN model offers increased, although yet limited capabilities to deal with the negative class.
For AESC, we also observed that SennaEmbeddings did not always provide top performances, being outperformed by other embeddings, even though the former were previously shown to offer the best performance for aspect extraction in all cases. We think this is related to the nature of the embeddings, since SennaEmbeddings were designed for the tasks in (Collobert et al., 2011) which do not include sentiment, while other embeddings can be regarded as general-purpose.

Conclusions
In this paper we presented the first fine-grained opinion mining study focusing on product video reviews. We introduced the first annotated dataset for the domain, Youtubean, and aspect extraction and AESC with a novel attention-RNN. Our model offered state-of-the art performance for AESC and results comparable to a strong RNN baseline for aspect extraction. Our descriptive corpus analysis as well as the performance obtained by all the models in our dataset suggest that differences between speech and written text, discussed extensively in the literature, also extend to the domain of product reviews, where they are relevant for fine-grained opinion mining. These findings introduce relevant research challenges and concrete paths for future researchers.
For future work, we plan to increase the size of our dataset and include reviews extracted from different product categories. By doing this, we intend to make our results more robust and to further study the differences between written and video review, ultimately deriving new ways to overcome them. Finally, we also want to exploit the additional data from YouTube, such as the audio, video or specific frames extracted from it, and user comments, to improve our results.