YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension

Multimodal semantic comprehension has attracted increasing research interests recently such as visual question answering and caption generation. However, due to the data limitation, fine-grained semantic comprehension has not been well investigated, which requires to capture semantic details of multimodal contents. In this work, we introduce “YouMakeup”, a large-scale multimodal instructional video dataset to support fine-grained semantic comprehension research in specific domain. YouMakeup contains 2,800 videos from YouTube, spanning more than 420 hours in total. Each video is annotated with a sequence of natural language descriptions for instructional steps, grounded in temporal video range and spatial facial areas. The annotated steps in a video involve subtle difference in actions, products and regions, which requires fine-grained understanding and reasoning both temporally and spatially. In order to evaluate models’ ability for fined-grained comprehension, we further propose two groups of tasks including generation tasks and visual question answering from different aspects. We also establish a baseline of step caption generation for future comparison. The dataset will be publicly available at https://github. com/AIM3-RUC/YouMakeup to support research investigation in fine-grained semantic comprehension.


Introduction
Videos which naturally contain rich multimodal semantic information have been one of the main sources for knowledge acquisition. In recent years, video semantic comprehension has attracted much research attention, with a number of datasets and tasks being proposed such as activity recognition (Xu et al., 2017), dense video captioning * Corresponding author. (Krishna et al., 2017a), visual question answering (Lei et al., 2018(Lei et al., , 2019 etc. However, most works are limited to only capturing coarse semantic information such as action recognition in broad categories. Fine-grained comprehension instead has not been fully explored, especially for discriminating actions with subtle difference or understanding temporal relations of actions in a certain activity. Instructional videos, which contain series of steps to accomplish certain tasks, are suitable sources to investigate fine-grained semantic comprehension and reasoning. As shown in Table 1, different instructional video datasets have been released. However, current datasets suffer from small data scales or coarse annotations to support fine-grained analysis. For example, datasets collected in (Alayrac et al., 2016;Rohrbach et al., 2012;Stein and McKenna, 2013;Kuehne et al., 2014;Das et al., 2013) only contain hundreds or fewer videos and actions. Although COIN dataset (Tang et al., 2019) is in large scale, it aims to cover wide range of action categories instead of distinguishing actions with subtle difference, and also lacks fine-grained step annotations. The YouCook2 dataset (Zhou et al., 2018) contains fairly large number of videos and temporal grounded sentences in the cooking domain. However, since different cooking steps contain apparent visual variation in actions, food and kitchen utilities, it might not require fine-grained reasoning over temporal and spatial dimensions to identify different steps.
In order to overcome previous limitations, we collect a new instructional video dataset named "YouMakeup" in specific makeup domain for finegrained multimodal semantic comprehension. The advantages of opting for the makeup domain are threefolds. Firstly, makeup instructional videos are more fine-grained in nature because different steps share the same facial background but con- Step Annotation Domain Source # of steps Type T.G S.G "5task" (Alayrac et al., 2016) 150 5 2 -Sent.
-General YouTube COIN (Tang et al., 2019) 11 tain at least one subtle but critical difference in action, tool or facial area. Therefore, it requires finegrained discrimination within temporal and spatial context. Secondly, there are abundant makeup instructional videos on the Internet with manual commentary or scripts, which makes it easy to collect and annotate. Last but not least, makeup video analysis is of great value which can facilitate both editing and searching process for cosmetic companies and users. The collected YouMakeup dataset consists of 2,800 makeup videos crawled from YouTube, which spans more than 420 hours. As shown in Figure 1, we manually annotate a sequence of natural language sentences to describe different instructional steps for each video and each step is grounded both in temporal video segment and spatial face areas in fine-grained details. There are totally 30,626 steps with 10.9 steps on average for each video, indicating the complexity of makeup activities.
For the purpose of comprehensively evaluating fine-grained analysis, we propose two groups of potential semantic comprehension tasks on YouMakeup: Generation and Question Answering (QA) tasks. The Generation tasks include temporal step segmentation, step caption generation and spatial area grounding, which reflects an overall semantic comprehension performance. In order to further measure video semantic reasoning ability, we design four QA tasks for detailed evaluations from four aspects as illustrated in Figure 8. The Facial Image Ordering task aims to track subtle changes on facial appearance after each step, which requires to reason influences of actions on object states. The Step Ordering task is to sort step descriptions according to their temporal order in the video, requiring temporal action reasoning and visual semantic matching. The Time Range Selection task requires the precise temporal localization of specific step in the video, forcing models to distinguish fine-grained difference between makeup steps. The Theme Inference task aims to select a best theme for the video, which demands high-level summarization of video content.
The main contributions of this work are threefolds: 1) We introduce a large-scale finegrained instructional video dataset "YouMakeup" in makeup domain to support research on finegrained multimodal semantic comprehension. To the best of our knowledge, it is the largest instructional video dataset in specific domain with finegrained temporal and spatial grounded annotation. 2) We propose two groups of tasks to evaluate finegrained video comprehension abilities, including generation tasks and four QA tasks, which require fine-grained semantic understanding and reasoning in different aspects and levels. 3) We propose a baseline framework for the step caption generation task to demonstrate that the fine-grained analysis and long temporal dependencies are essential for multimodal semantic comprehension.

Instructional Video Datasets
Existing instructional video datasets can be divided into two groups according to the domain diversity as summarized in Table 1. The first group aims to involve diverse activities from different domains. (Alayrac et al., 2016) contains 5 tasks such as "Making a coffee" and "Changing car tire". COIN (Tang et al., 2019) is a large scale dataset which contains videos of 180 different tasks in 12 domains related to our daily life. These datasets are constructed to improve model's generalization ability rather than support the fine-grained semantic comprehension. The other group focuses on specific domain, such as furniture assembling and cooking. (Rohrbach et al., 2012;Stein and McKenna, 2013;Kuehne et al., 2014) contain videos about simple cooking activities. YouCook (Das et al., 2013) consists of 88 videos with long text summarization. These datasets are limited in both number of videos and actions. YouCook2 (Zhou et al., 2018) is relatively large with 2000 videos, spanning 176 hours. Though cooking events are of rich semantics containing various foods, kitchen utilities and actions (Nishimura et al., 2019;Hahn et al., 2018), such variety makes it hard to measure fine-grained comprehension ability of models. For example, objects in "Sprinkle salt and pepper to the taste" and "Place the bacon at the top" are very different so that it might not be necessary to understand the whole details of actions to distinguish the two steps.
The strengths of our YouMakeup dataset compared with previous works are in two aspects: (1) it is large in scale with 420 hours in total. To the best of our knowledge, it is the largest instructional dataset in specific domain with rich finegrained annotations.
(2) Facial makeup is suitable for fine-grained comprehension in nature for all activities occur on the local facial area with subtle differences.

Video Comprehension Tasks
A wide range of tasks have been proposed for semantic comprehension on videos, such as action detection, dense video captioning and video question answering etc. The general video captioning task requires to generate a single sentence for the whole video, which cannot describe video content in details especially for long videos. So in the dense video captioning task, the model needs to detect meaningful events in the video and generate sentence to describe each one. Comparing to the dense captioning tasks which focus on activities with very different actions such as Activ-ityNet challenge (Krishna et al., 2017a), makeup instructional videos are more fine-grained which contain actions with subtle difference, providing more challenges for semantic comprehension.
Question answering is another way to effectively evaluate semantic understanding. Apart from image based QA datasets such as (Malinowski and Fritz, 2014;Antol et al., 2015;Ren et al., 2015a;Johnson et al., 2017), several video based datasets have been released to explore spatial and temporal inference of the video content. However, they mainly focus on comprehension within short video clips which contain simple activities and interactions, such as (Jang et al., 2017;Kim et al., 2016;Tapaswi et al., 2016;Maharaj et al., 2017), etc. The TVQA dataset (Lei et al., 2018(Lei et al., , 2019 is constructed from complex TV shows, but each question is associated with a short clip up to 90 seconds and more focused on joint understanding of visual and speech content. In comparison, our proposed four QA tasks on YouMakeup dataset are used to evaluate video semantic reasoning abilities from different aspects, such as spatial and temporal understanding for long videos, causality reasoning of actions and

Data Collection
Our goal is to build a large-scale multimodal instructional video dataset in the makeup domain to support the fine-grained semantic comprehension research. We start from collecting a list of famous cosmetic brands, such as Chanel, Mac, etc., and beauty bloggers with more than tens of thousands followers. These companies and bloggers are authoritative and professional in the makeup domain with many people learning makeup skills from them. Videos in their official channels are of high quality. Based on the list, we search their official channels on the YouTube and crawl makeup instructional videos together with available meta data such as video id, duration, title, tags and English subtitles generated by YouTube automatically. We process the subtitles into complete sentences aligned with video time stamps.
Since specific names of cosmetic brands are not important to understand makeup procedures, we filter out them in raw titles and subtitles of videos. We first create an initial list of cosmetic brands and then refine it gradually by checking words frequency and finding similar words via word embedding model (Tomas Mikolov, 2013). Finally, we utilize the refined list to remove cosmetic brands in raw texts. We also create lists for facial areas and cosmetic products through the similar process.

Step Annotation
We build an annotation system for step annotation. Figure 2 shows the interface of our annotation system. We provide video and English subtitles in the annotation page to assist the annotation. The products and facial areas in the subtitle are emphasized in red color to help annotators focus on related information. Annotators are asked to segment the video into a series of steps, which includes labeling the start and end time of each step, selecting the related facial areas in the given facial area list and creating the caption to describe the step according to the video and subtitles. We recruit female college students with more than two years of makeup experience as annotators. Before starting the annotation, each annotator is asked to annotate a test video to verify their capability for annotation. During annotation, each video is annotated by one person and reviewed by another to ensure the annotation quality.

Facial Image Annotation
We extract two groups of facial images from the video. The first group is used to understand the effect of makeup activities on facial appearance, which supports our proposed QA task described in Section 4.2. We extract images at the beginning and the end of each step to capture the facial appearance before and after each step. In order to select images containing faces, we extract images in 40 frames around each time stamp and filter them with a pretrained Multi-Task CNN (MTCNN)  for face detection. We then manually filter out unsuitable images such as side face images and pure product images.
The second group is to localize all facial areas referred in each step for spatial grounding. We extract key frames within the annotated segment of each step via comparing similarity of different frames followed by manual filtering of redundant frames. Then we automatically detect facial landmarks in the facial image and align the image region to corresponding facial areas of each step annotated in Section 3.1.1. Finally, we ask annotators to adjust the bounding box of these facial areas on the images and obtain the final grounded facial areas for each step.

Theme Annotation
Inferring the theme of a video is a basic ability to understand the video content. Therefore, we annotate the theme for each makeup video. The original title of the video usually summarizes the content or highlights specific features in the video, which can be treated as the theme of the video. As mentioned above, the specific cosmetic brand names are removed in the title for generalization. We further ask annotators to refine the title with the help of related meta information to make the theme more accurate.

Dataset Statistics
The final YouMakeup dataset contains 2800 videos, spanning 420 hours 50 minutes, and rich annotations. Videos can be mainly divided into three categories as shown in Figure 3: 1) Makeup for special occasions, such as school days or wedding days; 2) Makeup tips for specific facial area or cosmetic products, such as eye makeup or wearing red lipstick; 3) Makeup transformation, such as celebrities transformation. The first and third types usually create full looks, while the second focuses on specific step or facial area. We split the dataset into training, testing and validation set by 70%: 20%:10% and set up all the tasks on this division.

Video Length
Different from previous datasets containing videos with similar duration (Lei et al., 2018;Tang et al., 2019), videos in YouMakeup dataset are of various lengths, which reflects the complex situation in the realistic world. As shown in Figure 4(a), the length of videos varies from 15s to 1h with 9min on average. The large diversity of length results from the diverse video categories and styles. For example, tutorials from companies are usually short and come straight to the point, while those from beauty bloggers are more complex for they may show makeup skills or share their opinions about products in details.

Makeup Steps
There are 30,626 annotated steps in total with average 10.9 steps per video in YouMakeup. Figure 4(b) shows the distribution of step number. Compared with instructional video datasets in general domains, YouMakeup is more complex with more steps on average and therefore bringing more challenge for semantic understanding. Each step associates with at least one and up to seven facial areas. The frequency of grounded facial areas are presented in Figure 5. All these areas are close to each other on the face and might contain overlaps for some of them. For example, brow and brow bone are closely adjacent to each other and the under-eye area overlaps with cheeks. Therefore, fine-grained understanding is required to distinguish these areas. There are more than 1500 unique words occurred in the step captions. Figure 6(a) shows the Wordcloud for most frequent 100 words excluding stopwords. The most frequent words include actions, products, facial areas and tools as summarized in Table 2. These four categories of words can be combined in various ways, generating large number of different fine-grained makeup activities. For example, Figure 7 illustrates the fine-grained activities for the action "apply". The graph shows that the number of activities is large. Though the activities are similar, they are distinct in actions, products, facial areas or other aspects. Therefore, fine-grained comprehension is required for telling such subtle differences.   apply, use, blend, highlight, draw, curler, fix, clean, pat, shape, emphasize, press, sweep, correct, bake, tap, contour Products eyeshadow, concealer, powder, pencil, foundation, lipstick, eyeliner, blush, primer, shadow, highlighter, contour, bronzer, gel, cream,

Semantic Comprehension Tasks
For the purpose of exploring fine-grained semantic comprehension in different aspects and levels, we propose two groups of tasks namely generation and question answering, which contain 5 detailed tasks according to the instructional and multimodal characteristics of YouMakeup.

Task I: Step Generation
The step generation task includes identifying the temporal boundaries for steps, grounding the facial areas and generating natural language step description for each step as shown in Figure 1. It is a classical task in instructional video analysis for evaluating model's ability to capture the temporal flow of steps and grasp the procedural knowledge. Due to the similarity between makeup activities such as the similar motion, appearance of products and the adjacent location of facial areas, this task calls for a fine-grained comprehension of video content. The task is closely related to dense video captioning (Krishna et al., 2017a), however, it needs to integrate makeup procedural knowledge to generate fine-grained step segmentation and step description instead of independent event descriptions.

Task II: Facial Image Ordering
The facial image ordering task is to sort a series of shuffled facial images into the right order according to the ordered step captions, as shown in Figure 8(a). Instructional videos present steps for accomplishing a certain task. Tracking the changes of object is crucial for procedure comprehension. The effect of makeup is the fine-grained changes of facial appearances. Some steps bring apparent changes, for example, "Apply red lipsticks on the lips" turns the lip color from nude to bold red. However, some changes can be subtle and difficult to identify, such as "Apply foundation on the face with brush", which may result in subtle changes of the skin tone. Under most circumstances, the result brought by each step is very complex, which not only depends on the current step but also relies on the prior state of the facial appearance. Tracking changes on the face is an effective way to evaluate models' fine-grained comprehension ability. We choose five facial images of different steps from a video randomly to form a question. Then we set their original order as the ground truth and the other three random sorts as the candidate answers. We finally generate 12,000 questions, including 8,400 for training, 1,200 for validation, and 2,400 for testing.

Task III: Step Ordering
The step ordering task evaluates the capability of sorting a series of step captions into the right order according to the video. For human beings, comprehension is the prerequisite of ordering. The step ordering task aims at developing model's comprehension ability in a multimodal scenario, which calls for a joint understanding across different modalities. Models need to align natural language step captions with video content to solve the problem as shown in Figure 8(b), fine-grained understanding in both text and video content for temporal action reasoning and visual semantic matching is required.
We select videos with more than four steps to generate questions. For each question, we provide five captions of different steps from the video and four different order choices as the candidate answers, including the original one as the ground truth. We finally collect 12,000 questions for this task, including 8,400 for training, 1,200 for validation, and 2,400 for testing.

Task IV: Time Range Selection
In the time range selection task, models are required to localize a specific step in the video accurately. We set the task in QA form where both the video, the step caption and four candidate answers are provided. As the case shown in Figure 8(c), given the caption, the model needs to select the most accurate time range from four candidate answers after reviewing the whole video. To accomplish this task, models need to jointly understand the step caption and the video, distinguish the finegrained difference of the makeup activities in each video clips to find the best answer.
To form the question, we first select one step from the video, provide its step caption as question and set its time range as ground truth. Then we choose the other three steps which have some overlap in facial area with the question step from the same video, and set their time range as the candidate answers. We finally collect 12,000 questions for this task, including 8,400 for training, 1,200 for validation and 2,400 for testing.

Task V: Theme Inference
The theme inference task requires to inference the theme of a video. People can compress complicated content and extract key information. Several tasks have been set up to help the model develop such ability, such as the video summarization task which summarizes the video in several sentences. Theme inference is another way to evaluate this kind of ability. Since a video can be summarized with focus on multiple aspects, it is difficult to treat theme inference as generation task. We set it in QA form based on the annotation in Section 3.1.3.
As shown in Figure 8(d), we provide four candidate answers including the ground-truth. To generate candidate answers, we set up the candidate answer set with titles and tags of all videos. Then we utilize the FastText , Doc2Vec (Le and Mikolov, 2014) and Word2Vec (Tomas Mikolov, 2013) together for theme feature representation to search the top 10 nearest themes for each video. These themes are then selected by the annotator to form the candidate answers. Given the makeup videos, the model needs to select the most suitable theme from four choices.
Different from the video classification task which divides videos into fixed pre-defined categories, the candidate answers in the theme inference task are natural language sentences different from each other. As shown in Figure 6(b), the themes involve diverse aspects, such as "smokey" and "natural" for makeup style, "night" and "halloween" for occasions, and "pink" and "red" for color tone. Thus, the theme inference task requires a fine-grained comprehensive understanding of the video content in various aspects.

Experiment
We build a step caption generation system to provide a baseline for our YouMakeup dataset and demonstrate the necessity of fine-grained spatial and temporal understanding to solve the task. The system utilizes the groundtruth step segmentation in order to evaluate the captioning ability alone, and is evaluated by the standard captioning metrics including BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), ROUGE (Flick, 2004), CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016).
Implementation Details Our step caption generation model is based on the encoder-decoder captioning framework, which is widely used in video caption generation (Vinyals et al., 2015;Chen et al., 2019). The encoder converts the video clip into a fixed-dimensional vector and then the decoder generates word sequences conditioning on the encoded vector. Since both spatial and temporal reasoning are important for the task, we propose two types of encoder as follows: 1) spatial encoder: faster RCNN (Ren et al., 2015b) pretrained on the VisualGenome dataset (Krishna et al., 2017b) is used to extract object features in a single frame. We select one frame for each video clip and detect at most 36 objects in the frame. Mean pooling is applied on the extracted object features to generate the video-level representation. 2) temporal encoder: Resnet152 (He et al., 2016) pretrained on the ImageNet dataset (Deng et al., 2009) is used to extract features for each frame. We extract global frame-level features for every 16 frames and apply mean pooling on the temporal dimension to generate the global video-level representation. We employ the LSTM as our decoder, which contains 1 hidden layer with 512 hidden units. Adam optimizer is used to train our model with batch size of 128 and learning rate of 0.0001. We train at most 100 epochs and select the best model according to captioning performance on the validation set.
Results. Table 3 presents the step captioning performance with ground truth step segmentation on the testing set of YouMakeup. We can see that the combination of spatial and temporal features achieves the best performance, demonstrating spatial and temporal information is complementary to generate step captions. The captioning model based on the spatial feature alone is inferior to that based on the temporal feature because our framework does not utilize all clip information. The overall performance on Cider of the baseline step caption generation model is relatively higher than baselines on other video captioning datasets (Krishna et al., 2017a) due to the finegrained characteristics of the proposed dataset. In the YouMakeup dataset, the styles of step captions are similar which can make it easy for the caption generation system to achieve high scores since evaluation metrics are not aware of the aspect importance in the caption such as detailed tools, actions etc. Although the generation system achieves high evaluation scores, we find it fails to capture the fine-grained details in the makeup instructional videos. Figure 9 shows the caption generation results of the baseline system using both temporal and spatial features on a specific video for the first 6 steps. According to the captions of step 1 and 3, the system traces out the procedure roughly, lacking of details such as makeup tools and related facial area. The other three step captions indicate the system's weak ability on fine-grained video comprehension and confirm the fine-grained characteristics of YouMakeup. In step 2, the system mistakes color correction palette for eyeshadow due to their similar appearance. However, the eyeshadow is applied around eyes while color correction palette is used on the face for color correcting. From step 4 to 6, system shows confusion about the procedure of applying foundation and concealer because they are similar in both appearance and usage. System needs to associate products with their usage methods and facial areas they are applied in order to grasp the subtle difference between different makeup activities for generating accurate step description.

Conclusion
In this paper, we introduce a new large-scale instructional video dataset named YouMakeup for fine-grained semantic comprehension. The YouMakeup dataset contains 2,800 makeup instructional videos spanning more than 420 hours in total. Based on the characteristics of makeup instructional videos and the rich annotations of temporal boundaries, grounded facial areas and natural language descriptions of steps, our collected dataset is more suitable to support the finegrained video comprehension research than previous datasets. We further design a generation task and four question answering tasks to thoroughly evaluate the fine-grained semantic comprehension ability from different aspects and levels. A baseline system for step caption generation also demonstrates the necessity of fine-grained spatial and temporal information. In the future work, we plan to make thorough exploration on these proposed tasks. We will make the dataset publicly accessible in order to support the research investigation in fine-grained semantic comprehension.