Computer Assisted Annotation of Tension Development in TED Talks through Crowdsourcing

We propose a method of machine-assisted annotation for the identification of tension development, annotating whether the tension is increasing, decreasing, or staying unchanged. We use a neural network based prediction model, whose predicted results are given to the annotators as initial values for the options that they are asked to choose. By presenting such initial values to the annotators, the annotation task becomes an evaluation task where the annotators inspect whether or not the predicted results are correct. To demonstrate the effectiveness of our method, we performed the annotation task in both in-house and crowdsourced environments. For the crowdsourced environment, we compared the annotation results with and without our method of machine-assisted annotation. We find that the results with our method showed a higher agreement to the gold standard than those without, though our method had little effect at reducing the time for annotation. Our codes for the experiment are made publicly available.


Introduction
Recently, researchers for natural language processing are paying more attention to crowdsourcing for its effectiveness in linguistic annotations. The recent development in crowdsourcing platforms such as Amazon Mechanical Turk (AMT) has much reduced the time and effort required for an annotation project. Many researchers proposed methods to assist the workers in the crowdsourced annotation (Yuen et al. (2011); Poesio et al. (2013); Guillaume et al. (2016); Madge et al. (2019); Yang et al. (2019)). In particular, Guillaume et al. (2016) designed a game-based platform for the annotation of dependency relations in French text, with the prediction model embedded in their platform. Yang et al. (2019) proposed to predict the difficulty of an annotation unit in order to allocate relatively easy units to crowdsourcing workers and the rest to expert annotators.
In this paper, we present a machine-assisting method for effective annotation of tension development. Tension is a means to keep the attention of the reader or audience, studied mainly in the field of storytelling (Zillmann (1980); Klimmt et al. (2009) ;Niehaus and Young (2014)). Tension also plays a critical role in discourse development (Lehne and Koelsch, 2015). We annotate the tension development, whether the tension is increasing, decreasing, or staying unchanged, in the TED Talks. We also introduce a Self-Assessment Manikin (SAM), which is an intuitive diagram that helps understand the annotation guidelines for tension annotation. Our method uses a prediction model for tension development, and provides the annotators with model predicted results as initial values. The predictions are based on the audio, the subtitle of the given video clip and the previous annotation results by an annotator.
We validate our method through an experiment on crowdsourced annotations. The annotations with our method show a higher agreement to the gold standard, which we instructed manually by annotating independently from the crowdsourced annotations, than those without our method. However, contrary to our initial expectation that our method will also reduce the annotation time, we find that it hardly reduced the time.
The contributions of this paper are as follows. (1) We proposed a new annotation scheme using the Self-Assessment Manikin (SAM) to annotate the tension development on multimodal data. (2) To the best our knowledge, our method is the first in utilizing a prediction model to assist the annotation of tension development. We show experimen- Figure 1: Overview of the annotation process tally that our method is effective at gathering highquality data and provide a detailed analysis of the annotation results. (3) We make the related data and the code publicly available.
2 Related work 2.1 Computer-Assisted Annotation Ringger et al. (2008) suggested a machine-assisted method for part-of-speech (POS) tagging. They provided model predicted results to the annotators so that the annotators may focus only on incorrect predictions. There has been a line of researches for effective visualization and an improvement on the user-interface that can help a linguistic annotation process (Stenetorp et al. (2012);Yimam et al. (2013)). Guillaume et al. (2016) provided a game-based platform for the annotation of dependency relations in French text and used a prediction model as a part of the platform in the training phase for the annotators before the main data gathering. For the selection of the target data to annotate, active learning has been employed to selectively collect only the training data on which the model does not perform well in order to maximize the performance of the model with a dataset that is as small as possible (Wang et al. (2017); Duong et al. (2018)). Schulz et al. (2019) showed that the provision of the automatically generated annotation results can accelerate the annotation process and enhance the annotation quality, without incurring a significant bias.
For visual object detection, Yao et al. (2012) presented an annotation platform that contains a prediction model for the location of the given object. In their platform, the model presents the predicted location to the annotators, and the annotators modified the location if it is incorrect. They also predicted the time that the annotator may take for the modification and presented the annotation unit to the annotators with the shortest expected time to minimize the total cost of their annotation project. Su et al. (2012) presented a quantification test that can identify the annotators who do not fully understand the annotation guidelines. They also presented a rule-based feedback system that can warn untrained annotators before continuing the annotation.

Emotion, suspense, and tension
Tension is a psychological concept that is related to emotion and suspense. Tension has been studied along with suspense for the literature, movies, and games (Brewer and Lichtenstein (1982); Zillmann (1980); Klimmt et al. (2009)). Lehne and Koelsch (2015) proposed a general psychological model for tension without any further restriction on its domain, defining the magnitude of tension as the interval between positive and negative expectations of the outcome.
In the field of computer science, there has been a line of researches modeling the mental state of the reader to create an intense story (Niehaus and Young (2014); O'Neill and Riedl (2014)). Li et al. (2018) designed a scheme for story structures considering dramatic tension changes and the narrative structure suggested by Helm and MacNeish (1967) and annotated the story structure for short stories and personal anecdotes. For the analysis of emotion, Cowie and Sawey (2011) annotated on the intensity of laughter and the degree of positive emotion in the videos of babies. Metallinou and Narayanan (2013) annotated on activation, valence, and dominance with an assumption that the three attributes represent the state of emotion in video. Antony et al. (2014) annotated changes in

Data
We used the TED Talks as a dataset to track the tension development. TED Talks are a conference that presents ideas on various topics in a few minutes, and the video part has been used for emotional analysis and assessment of engagement exploiting the highly reliable English subtitles precisely synchronized to the video (Neumann and Vu (2019); Haider et al. (2017)). For the annotation of tension development, we have chosen to use TED Talks with two specific reasons: (1) Due to the nature of public lectures, many utterances raise the tension to keep the attention of the audience.
(2) The applause or laughter of the audience, which may be highly related to tension development, is also recorded in the video.
In the archives of TED Talks 2 , we randomly selected 20 videos whose running time is in the range of 10-20 minutes. For each of the 20 videos, we divided it into a set of small video clips, where the division was based on the subtitles so that a clip corresponds to a sentence. The English subtitles were split into sentences. We obtained a dataset containing 3,597 video clips with a total duration of 301 minutes. Each sentence that corresponds to a video clip consists of 14 words on

Method
Our method uses a neural network based prediction model, and provides the predicted results to the annotators as the initial values for the options that the annotator is asked to fill out. By this, the annotation task, originally to choose the correct label for a given video clip, is transformed into an evaluation task, judging whether or not the predicted result by the model is correct. Figure 2 shows the architecture of our model. The model predicts the label for each video clip sequentially, and utilizes three features: subtitles, audio, and the formerly chosen labels for the previous video clips. The audio of a video clip was encoded into a vector using CNN. We used pyAu-dioAnalysis software (Giannakopoulos, 2015) to extract 34 features such as MFCC at the rate of 30 frames/sec, and the features were passed to the CNN. The CNN consists of three 1D convolutional layers. 1D max-pooling with ReLU activation function is performed after each convolutional layer. The lecture's subtitles were encoded into a vector using a pre-trained uncased BERTbase model (Devlin et al., 2019). The previously chosen k labels were encoded into a vector using an RNN. The three vectors for the three features were concatenated into a vector, passed afterwards to the output layer, or the fully connected layer.   Figure 1 gives an overview of our annotation of tension development. First, as shown in Table 1, 10 TED Talks videos were divided into group A and group B. In-house annotation was performed on group A and the results, which we call data A, were used for training the prediction model. Then, group B was annotated through Amazon Mechanical Turk (AMT), a crowdsourcing platform. For group B, the crowdsourced annotation was conducted in two phases. First, every video in group B was annotated via AMT using our method (data B). Second, independently of the first, every video in group B was annotated via AMT, not using our method (data B ). For a video, the annotators watched the video clips in their original order, and annotated on each clip with one of the three labels, up, down, and similar. Up indicates that the tension is increasing, and down indicates that it is decreasing. Similar indicates that the tension is not changing. As it is disruptive for the annotator to iterate the clicking on the video for playing and pausing, we made an annotation tool to prevent such disruption ( Figure  3).
Due to the copyright issue, we could not post the TED Talks video directly online. Instead, we provided the annotators, or crowdsourcing workers, with the videos at TED's official Youtube channel 3 via an embedded player, controlled by the APIs provided by the Youtube player. If the annotator enters a shortcut key to move to the next video clip or presses the play button of the video clip, the video clip is played. After the video clip meets the end (of the clip), an input window for annotation is displayed. Then, the annotator can perform the annotation on the clip, and proceed to the next clip. We also provided the subtitles explicitly to the annotators.

Annotation Scheme
The tension development within each video clip was annotated with one of the three values (up, down, or similar). We defined each of the three labels based on the specific circumstances in Ta  thinking, surprised, annoyed, and confused, correspond to up. If a video clip can be described as one of the five circumstances, we defined the video clip to have the label of up. In a similar way, three circumstances, or relieved, funny, and boring, correspond to the label of down. If a video clip is judged to be neither up nor down, we defined it as having the label of similar. It should be noted that the definition of the labels is designed specifically for the domain of public lectures. For example, ridiculing someone in everyday life may increase the tension. Still, in lectures, it is often intended to help the audience to feel relaxed and help them to feel comfortable listening (Meyer, 2000). Therefore, we set it as a circumstance for down.
To help the annotators to intuitively follow up the annotation guidelines, and for the cases where the annotators forget the details of the guidelines (of the specification of the circumstances), we provided Self-Assessment Manikins (SAMs) to the annotators as shown in Table 2. Providing SAMs to annotators has been acknowledged to be an effective method for an emotion-related annotation task (Bradley and Lang (1994); Yadati et al. (2013); Boccignone et al. (2017)).

In-house Annotation
The in-house annotation method was used to annotate 1,736 video clips (group A). A total of five annotators participated, and three annotators anno-   (2013)). For in-house annotations, we obtained the agreements as shown in Table 5. Pearson's correlation and Cronbach's α were measured as the cumulative sum of the scores.

Crowdsourcing Annotation
Of the data collected via in-house annotations to the Group A videos, 70% were used as the training set and 30% were used as the test set to train and evaluate the model (Table 3). When setting the ground truth from data annotated by three people in the same video clip, we decided to use majority voting among down, similar, and up labels. If each label was selected once, the label similar was set as the ground truth. When annotating with crowdsourcing, the videos in group B were annotated with and without machine assistance by three annotators each ( Figure 5). Video clips annotated without machine assistance were annotated using the same interface as used for the in-house annotation. During machine-assisted annotation, predicted values by the model are presented along with the probability, and the label with the highest probability was given to the annotator as the default value. The trained model provided predicted values in realtime using the subtitles, sound of the video clips and the tension values that the user annotated in the previous five video clips. Annotators were instructed to refer to the automatic prediction value: Figure 6: Confusion matrix of the prediction model on the test set "Please note that the value of the predicted tension is automatically given as the default value. If your judgment is different, change the value according to your judgment. If the default value matches your judgment, you may move on to the next video clip." We used the Amazon Mechanical Turk (AMT) service for crowdsourcing, providing workers with annotation guidelines and the URL for the webbased annotation tool. Each worker was allowed to participate in annotating several different videos. Workers with the number of HITs approved > 50 and HIT approval rate > 95% were allowed to join. There were a total of 47 annotators.   Figure 6 shows the confusion matrix of the prediction model in the test set. The performance (F1 score) for the down label (0.64) was higher than that for up (0.44).  As the result of the annotation, 11,166 annotation values were obtained for 10 videos with 1,861 video clips (group B). For machine-assisted annotations, the distribution of down, similar and up was 895 (16.0%), 2,862 (51.3%), and 1,826 (32.7%), respectively. For unassisted annotations from the machine, the distribution was 977 (17.5%), 2,372 (42.4%), and 2,232 (39.9%). Table 5 shows the agreement among the annotation results. In-house annotations were all higher in all the three metric than the crowdsourced annotations without machine-assistance. In the control group, machine-assisted annotations showed higher levels of agreement than non-assisted annotations.

Analysis of Annotations
We analyzed whether the improvement of agreement rate was a negative effect from the bias resulting from the predicted labels. For analysis, gold labels were compared to annotations. Gold labels were set by the annotations of one of the authors with no machine assistance in 4 videos selected in group B. Figure 7 shows an example of such gold labels, machine-assisted annotations and the annotations of the control group for the cumulative sum of the tension score. Comparing the mean correlation for the 4 videos, the mean correlation of the machine-assisted annotations was 0.861, higher than the control group's mean correlation of 0.466. The annotation values were more in line with the trend among gold labels with machine-assistance.
The mean correlation between machine predictions itself and gold labels was 0.867. This means that machine-assisted annotators can achieve results closer to gold than the control group if they accept all the predicted values. However, machineassisted annotators changed 26.5% of the labels presented as default values through the model (Figure 8). The change ratio of prediction values for each of down, similar and up is 17.7%, 28.8% and 24.3%, respectively. This produced a difference between machine predictions and machineassisted annotations, as illustrated in Figure 7. The average of the probabilities (as shown in Figure  5) presented with labels set as default values by the prediction model was 90.4%. When the user changed the default value, the average of the probabilities was 87.0%. When the user did not change the default value, the average was 91.6%.
The selection times in Table 5 represent the amount of time it takes to select the tension label from the time the video clip is played to the end. For machine-assisted annotations, if the default value is not changed by the annotator, the time between the end of the current video clip and the start of the next video clip was considered as the selection time. When receiving machine assistance, the annotation time was expected to be reduced because the input process of selecting labels would disappear if the model prediction values and the annotator's judgments were the same. However, there was no significant difference compared to the control group.

Conclusion
In this paper, we introduced a method for machineassisted annotation of tension development. Our method utilizes a prediction model to provide the predicted result to the annotators so that the annotation task is turned into an evaluation task of inspecting whether or not the prediction by the model is correct. We find that our method enhances the agreement of the crowdsourced anno- tations to the gold standard annotation in a small trial of 3 annotators. We also find that our method does not particularly affect the time taken for the annotation.
We proposed a new annotation scheme using the Self-Assessment Manikin (SAM) to annotate the tension development. By converting the annotation task into a verification task via machine assistance, the results become consequently more aligned with the gold standard compared with the control group.