Conversational Image Editing: Incremental Intent Identification in a New Dialogue Task

We present “conversational image editing”, a novel real-world application domain combining dialogue, visual information, and the use of computer vision. We discuss the importance of dialogue incrementality in this task, and build various models for incremental intent identification based on deep learning and traditional classification algorithms. We show how our model based on convolutional neural networks outperforms models based on random forests, long short term memory networks, and conditional random fields. By training embeddings based on image-related dialogue corpora, we outperform pre-trained out-of-the-box embeddings, for intention identification tasks. Our experiments also provide evidence that incremental intent processing may be more efficient for the user and could save time in accomplishing tasks.


Introduction
The development of digital photography has led to the advancement of digital image editing, where professionals as well as hobbyists use software tools such as Adobe Photoshop, Microsoft Photos, and so forth, to change and improve certain characteristics (brightness, contrast, etc.) of an image.
Image editing is a hard task due to a variety of reasons: (1) The task requires a sense of artistic creativity. (2) The task is time consuming, and requires patience and experimenting with various features before settling on the final image edit.
(3) Sometimes users know at an abstract level what changes they want but are unaware of the image editing steps and parameters that will result in the desired image. For example, a person's face in a photo may look flushed, but the users may not know that adjusting the saturation and the temperature settings to some specific values will change the photo to match their expectations. (4) Users are not sure what changes to perform on a given image. (5) Users are not fully aware of the features and the functionality that are supported by the given image editing tool.
Users can often benefit from conversing with experts to edit images. This can be seen in action in web services such as the Reddit Photoshop Request forum 1 , Zhopped 2 , etc. These web services include two types of users: expert editors who know how to edit the photographs, and novice users who post their photographs and request changes to be made. If the editor needs further clarification regarding the requested change, they post their query and wait for a response from the user. The conversational exchanges also happen through edit feedback where the editor interprets the user request and posts the edited photographs. The user can reply with further requests for changes until they are fully satisfied. Due to this message-forum-like setup, users do not have the freedom to request changes in real time (at the same time as the changes are actually being performed), and hence often end up with edited images that do not fully match their requests. Furthermore, the editors are often unable to provide suggestions that could make the photograph fit better the user's narrative for image editing.
In this setup the users can benefit greatly from conversing with an expert image editor in real time who can understand the requests, perform the editing, and provide feedback or suggestions as the editing is being performed. Our ultimate goal is to build a dialogue system with such capabilities.
Conversational image editing is a task particularly well suited for incremental dialogue processing. It requires a lot of fine-grained changes (e.g., changing brightness to a specific value), which often cannot be just narrated with a command. In order to perform such fine-grained changes to the user's liking, it is necessary that the editor understands the user utterances incrementally (word-byword) and in real time, instead of waiting until the user has finished their utterance. For example, if the user wants to increase the brightness, they could utter "more, more, more" until the desired change has been achieved. The changes should occur as soon as the user has uttered "more" and continue happening while the user keeps saying "more, more".
In this paper, our contributions are as follows: (1) We introduce "conversational image editing", a novel dialogue application that combines natural language dialogue with visual information and computer vision. Ultimately a dialogue system that can perform image editing should be able to understand what part of the image the user is referring to, e.g., when the user says "remove the tree".
(2) We provide a new annotation scheme for incremental dialogue intentions. (3) We perform intent identification experiments, and show that a convolutional neural network model outperforms other state-of-the-art models based on deep learning and traditional classification algorithms. Furthermore, embeddings trained on image-related corpora lead to better performance than generic out-of-the-box embeddings. (4) We calculate the impact of varying confidence thresholds (above which the classifier's prediction is considered) on classification accuracy and savings in terms of number of words. Our analysis provides evidence that incremental intent processing may be more efficient for the user and save time in accomplishing tasks. To the best of our knowledge this is the first time in the literature that the impact of incremental intent understanding on savings in terms of number of words (or time) is explicitly measured. DeVault et al. (2011) measured the stability of natural language understanding results as a function of time but did not explicitly measure savings in terms of number of words or time.

Related Work
Combining computer vision and language is a topic that has recently drawn much attention. Some approaches assume that there are manual annotations available for mapping words or phrases to image regions or features, while other approaches employ computer vision techniques. Research is facilitated by publicly available data sets such as MS COCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2017). Typically image and language corpora consist of digital photographs paired with crowdsourced captions, and sometimes mappings of words and captions to specific parts of an image. Yao et al. (2010) is an example of a work relying on manual input. They developed a semiautomatic method for parsing images from the Internet to build visual knowledge representation graphs. On the other hand, the following works did not rely on manual annotations. Feng and Lapata (2013) generated captions from news articles and their corresponding images. Mitchell et al. (2012) and Kulkarni et al. (2013) built systems for understanding and generating image descriptions.
Due to space constraints, below we focus on work that combines computer vision or visual references (enabled through manual annotations) and language in the context of a dialogue task, which is most relevant to our work. Antol et al. (2015) introduced the "visual question answering" task. Here the goal is to provide a natural language answer, given an image and a natural language question about the image. Convolutional neural networks (CNNs) were employed for encoding the images (Krizhevsky et al., 2012). This was later modeled as a dialogue-based question-answering task in Das et al. (2017). These works used images from the MS COCO data set. de Vries et al. (2017) introduced "GuessWhat?!", a two-player game where the goal is to find an unknown object in a rich image scene by asking a series of questions. They used images from MS COCO and CNNs for image recognition. Paetzel et al. (2015) built an incremental dialogue system called "Eve", which could guess the correct image, out of a set of possible candidates, based on descriptions given by a human. The system was shown to perform nearly as well as humans. Then in the same domain, Manuvinakurike et al. (2017) used reinforcement learning to learn an incremental dialogue policy, which outperformed the high performance baseline policy of Paetzel et al. (2015) in offline simulations based on real user data. Each image was associated with certain descriptions and the game worked for a specific data set of images without actually using computer vision. Manuvinakurike et al. (2016a) developed a model for incremental understanding of the described scenes among a set of complex configurations of geometric shapes. Kennington and Schlangen (2015) learned perceptually grounded word meanings for incremental reference resolution in the same domain of geometric shape descriptions, using visual features.
Huang et al. (2016) built a data set of sequential images with corresponding descriptions that could potentially be used for the task of visual storytelling. Mostafazadeh et al. (2016) introduced the task of "visual question generation" where the system generates natural language questions when given an image, and then Mostafazadeh et al. (2017) extended this work to natural language question and response generation in the context of image-grounded conversations.
Some recent work has started investigating the potential of building dialogue systems that can help users efficiently explore data through visualizations (Kumar et al., 2017).
The problem of intent recognition or dialogue act detection has been extensively studied. Below we focus on recent work on dialogue act detection that employs deep learning. People have used recurrent neural networks (RNNs) including long short term memory networks (LSTMs), and CNNs (Kalchbrenner and Blunsom, 2013;Li and Wu, 2016;Khanpour et al., 2016;Shen and Lee, 2016;Ji et al., 2016;Tran et al., 2017). The works that are most similar to ours are by Lee and Dernoncourt (2016) and Ortega and Vu (2017) who compared LSTMs and CNNs on the same data sets. However, neither Lee and Dernoncourt (2016) nor Ortega and Vu (2017) experimented with incremental dialogue act detection as we do.

Data
We use a Wizard of Oz setup to collect a dialogue corpus in our image edit domain. The Wizard-user conversational session is set up over Skype and the conversation recorded on the Wizard's system. The screen share feature is enabled on the Wizard's screen so that the user can see in real time the changes requested. There are no time constraints, and the Wizard and the user can talk freely until the user is happy with the changes performed. Users may have varying levels of image editing expertise and knowledge of the image editing tool used during the interaction (Adobe Lightroom).
Each user is given 4-6 images and time to think of ways to edit them to make them look better. The conversation typically begins with the step called image location. The user describes the image in a unique manner so that it can be located in the library of photos by the Wizard. If the descriptions are not clear the Wizard can ask clarification questions. Once the image is located, the user conveys to the Wizard the changes they desire. The user and the Wizard have a conversation until the user is happy with the final outcome. In order to capture all the changes that the user wants to achieve in spoken language, the image editing tool is controlled only by the Wizard. Figure 4 in the Appendix shows the Adobe Lightroom interface as seen by the user and the Wizard. Note that users were not explicitly told that they would interact with another human and could not see who they interacted with because the Wizard and the user were in different locations. However, the naturalness of the conversation made it obvious that they were conversing with another human.
The photographs chosen for the study are sampled from the Visual Genome data set (Krishna et al., 2017). For the dialogue to be reflective of a real-world scenario the images sampled should be representative of the images regularly edited by the users. We sampled 200 photoshop requests from the Reddit Photoshop Request forum and Zhopped, and found that the images in those posts fell into eight high-level categories: animals, city scenes, food, nature/landscapes, indoor scenes, people, sports, and vehicles.   Figure 1a shows the annotation of the dialogue acts for the user utterances.  Figure 1 shows a sample conversation between the user and the Wizard, and Table 1 shows the statistics of the data. Details of the semantics of the conversation are discussed in Section 4. Each dialogue session ranges between 2-30 min (7 min on average). The dialogues were transcribed via crowdsourcing (Amazon Mechanical Turk). We intend to publicly release the data.

Dialogue Semantics
The data collected were annotated with dialogue acts. User utterances were segmented at the word level into utterance segments. An utterance is defined as a portion of speech preceded and/or followed by a silence interval greater than 300 msec. Each utterance segment was then assigned a dialogue act. The annotations were performed by two expert annotators. The inter-annotator agreement was measured by having our two annotators annotate the same dialogue session of 20 min, and kappa was found to be 0.81 which indicates high agreement. Below we describe briefly our dialogue act scheme.
Image Edit Requests: The most common dialogue acts used by the user are called "Image Edit Requests (IERs)". These are user requests concerning the changes to be made to the images. IERs are further categorized into 4 groups: IER-New (IER-N), IER-Update (IER-U), IER-Revert (IER-R), and IER-Compare (IER-C). IER-N requests refer to utterances that are concerned with new image edit requests different from the previously requested edits. These requested changes are either abstract ("it's flushed out, can you fix it?") or exact ("change the saturation to 20%"). The Wizard interprets these requests and performs the changes. IER-U labels are used for utterances that request updates to the previously mentioned IER-Ns. These include the addition of more details ("change it to 50%") to the IER-N ("change the saturation"), issuing corrections to the IER ("can you reduce the value again?"), modifiers (more, less), etc. If the users are completely unhappy with the change they can revert the change made (IER-R). The IER-R act is used if the user reverts the complete changes performed, compared to only changing the values. For example, if the user is modifying the saturation of the image and across multiple turns changes the value of saturation from 20% to 30% and back to 20%, the user's action is labeled as IER-U. If the user wants all the saturation changes to be undone, the user's action is labeled as IER-R. Users may also want to compare the changes made across different steps ("can we compare this to the previous update?"), and this action is labeled as IER-C.
Comments: Once the changes are performed the user is typically happy with the change and issues a comment that they like the edit (COM-L), or they are unhappy and issues a comment that they dislike the edit (COM-D). In some cases the users are neutral and neither like nor dislike the edit. Typically such utterances are comments on the images and are labeled as COM-I.

Requests & Responses:
The user may ask the Wizard to provide suggestions on the IERs. These are labeled as "Request" acts. "Yes" and "no" responses uttered in response to the Wizard's suggestions are labeled as RS-Y or RS-N.
Suggestions: This is the most commonly used Wizard dialogue act after "Acknowledgments". When the user does not know what edits to perform, the Wizard issues suggestion utterances with the intention of providing the user with ideas about the changes that could be performed. The Wizard provides new suggestions (S-N), e.g., "do you want to change the sharpness on this image?". The Wizard could also provide update suggestions for the current request under consideration (S-U), e.g., "sharpness of about 50% was better".
Other user actions are labeled as questions about the features supported by the image editing tool, clarifications, greetings, and discourse markers. In total there are 26 dialogue act labels, including the dialogue act "Other (O)" which covers all of the cases that do not belong in the other categories. In this work we are interested in the task of understanding the user utterances only, and in particular, in classifying user utterances into one of 10 labels: IER-N, IER-U, IER-R, IER-C, RS-Y, RS-N, COM-L, COM-D, COM-I, and O.
An agent will eventually be developed to replace the Wizard, which means that the agent will need to interpret the user utterances. The task of understanding the user utterance happens in two phases. In the first step the goal is to identify the dialogue acts. The second step is to understand the user image edit requests IER-N and IER-U at a fine-grained level. For example, when the user says "make the tree brighter to 100", it is important to understand the exact user's intent and to translate this into an action that the image editing tool can perform. For this reason we use actionentities tuples <action, attribute, location/object, value>. The user utterances are mapped to dialogue acts and then to a pre-defined set of image action-entities tuples which are translated into image editing actions. For more information on our annotation framework for mapping IERs to actionable commands see Manuvinakurike et al. (2018). It is beyond the scope of this work to perform the image editing and we intend to pursue this in future work. Table 2 shows an example of the process of understanding the image edit requests. Table 3 shows example utterances for some of the most frequently occurring dialogue acts in the corpus. In these examples it can be seen that, with the exception of 3, all the other dialogue acts can be identified with some degree of certainty without waiting for the user to complete the utterance. Also, Figure 5 in the Appendix shows example IERs. One of the motivations for our work is to identify the right dialogue act at the earliest time.   Not only is this more efficient but also more natural. The human Wizard can begin to take action even before the utterance completion, e.g., in utterance 1 the Wizard clicks the "vignette" feature in the tool before the user has finished uttering their request. Another goal is to measure potential savings in time gained through incremental processing, i.e., how much we save in terms of number of words when we identify the dialogue act earlier rather than waiting until the full completion of the utterance, without sacrificing performance.

Model Design
For our experiments we use a training set sampled randomly from 90% of the users (116 dialogues for training, 13 dialogues for testing). We use word embedding features whose construction is described in Section 6.1. There are several reasons for using word embeddings as features, e.g., unseen words have a meaningful representation and provide dimensionality reduction. 3 3 Figure 6 shows the visual presentation of the utterances embeddings using t-SNE (Maaten and Hinton, 2008).

Constructing Word Embeddings
We convert the words into vector representations to train our deep learning models (and a variation of the random forests). We use out-of-thebox word vectors available in the form of GloVe embeddings (Pennington et al., 2014) (trained with Wikipedia data), or we employ fastText (Bojanowski et al., 2017) to construct embeddings using the data from the Visual Genome image region description phrases, the dialogue training set collected during this experiment, and other data related to image editing that we have collected (image edit requests out of a dialogue context). From now on these embeddings trained with fastText will be referred to as "trained embeddings".
As we can see in Table 4, for models E (LSTMs) and I (CNNs) we use word embeddings trained with fastText on the aforementioned data sets. The Vanilla LSTM (model D) does not use GloVe or trained embeddings, i.e., there is no dimensionality reduction. Model H (CNN) uses GloVe embeddings. The vectors used in this work (both GloVe and trained embeddings) have a dimension of 50. For trained embeddings, the vectors were constructed using skipgrams over 50 epochs with a learning rate of 0.5.
Recent advancements in creating a vector representation for a sentence were also evaluated. We used the Sent2Vec (Pagliardini et al., 2018) toolkit to get a vector representation of the sentence and then used these vectors as features for models G and J. Note that LSTMs are sequential models where every word needs a vector representation and thus we could not use Sent2Vec.

Model Construction
We use WEKA (Hall et al., 2009) (Abadi et al., 2016) for the LSTM and CNN models. The models B, C, D, and F in Table 4 use bag-of-words features. The CNN has 2 layers, with the first layer containing 512 filters and the second layer 256 filters. Both layers have a kernel size of 10 and use ReLU activation. The layers are separated by a max pooling layer with a pool size of 10. The dense softmax is the final layer. We use the Adam optimizer with the categorical cross entropy loss function. The LSTM cell is made up of 2 hidden layers. We use a dropout with keep prob = 0.1. We put the logits from the last time steps through the softmax to get the prediction. We use the same optimizer and loss function as for the CNN since they were found to be the best performing. Table 4 shows the dialogue act classification accuracy for all models on our test set. Here we assume that we have the correct utterance segmentation for both the training and the test data. Note that because of the "Other" dialogue act all words in a sentence will belong to a segment and a dialogue act category. We hypothesize that the poor performance of the sequential models (CRF and LSTM) is due to the lack of adequate training data to capture large context dependencies. Table 5 shows the savings in terms of overall number of words and average number of words saved per sentence, for each dialogue act in the corpus. Figure 2 shows the confidence curves for predicting the dialogue act with the progression of every word. From this figure it is clear that after listening to the word "photo" the classifier is confident enough that the user is issuing the IER-N command. Here the notion of incrementality is to predict the right dialogue act as early as possible and evaluate the savings in terms of the number of words. While from this example it is clear that the correct dialogue act can be identified before the user completes the utterance, it is not clear when to commit to a dialogue act. The trade-off involved in committing early is often not clear. Table 5 shows the maximum savings that can be achieved in an ideal scenario where an oracle (an entity informing if the prediction is correct or wrong as soon as the prediction is made) identifies the earliest point of predicting the correct dialogue act.

Incrementality
The method used for calculating the savings is shown in Table 6. In this example for the utterance "I think that's good enough", we feed the classifier the utterances one word at a time and get the classifier confidence. The class label with the highest score is obtained. Here the oracle tells us that we could predict the correct class COM-L as soon as "I think that's good" was uttered and thus the word savings would be 1 word.
However, in real-world scenarios the oracle is not present. We use several confidence thresholds and measure the accuracy and the savings achieved in predicting the dialogue act without the oracle. For the predictions in the test set we get the accuracy for each of the thresholds. Then if the   predictions are correct, we calculate the savings. Thus Figure 3 shows the word savings for each confidence threshold when the predictions are correct for that threshold. So in the example of Table 6, for a confidence threshold value of 0.4, we extract the class label assigned for the utterance once the max confidence score exceeds 0.4. In this case once the word "good" was uttered by the user the confidence score assigned (0.5) was higher than the threshold value of 0.4 and we take the predicted class as COM-L. The word savings in this case is 1 word and our prediction is correct. But for a confidence threshold value of 0.2, our prediction would be the tag O which would be wrong and there would be no time savings. Figure 3 shows that as the confidence threshold values increase the accuracy of the predictions rises but the savings decrease.
Researchers have used simulations (Paetzel et al., 2015) or a reinforcement learning policy (Manuvinakurike et al., 2017) to learn the right points of interrupting the user which are dependent on the language understanding confidence scores.
Here we do not focus on learning such policies. Instead, our work is a precursor to learning an incremental system dialogue policy.

Conclusion
We presented "conversational image editing", a novel real-world application domain, which combines dialogue, visual information, and the use of computer vision. We discussed why this is a domain particularly well suited for incremental dialogue processing. We built models for incremental intent identification based on deep learning and traditional classification algorithms. We calculated the impact of varying confidence thresholds (above which the classifier's prediction is considered) on classification accuracy and savings in terms of number of words. Our experiments provided evidence that incremental intent processing could be more efficient for the user and save time in accomplishing tasks. Figure 4: The interface as seen by the user and the Wizard. We use Adobe Lightroom as the image editing program.

Tag
User Edit Requests IER-N I want to um add more focus on the boat IER-N can you make the water uh nicer color IER-N uh can we crop out uh little bit off the bottom IER-N is there a way to add more clarity IER-N can we adjust the shadows IER-U more [saturation] IER-U can we get rid of the hints of green in it IER-U bluer IER-U little bit more from the left [crop] IER-R can you unfocus it IER-C can you show me before and after Figure 5: Example user edit requests. Only two bounding boxes are labeled in the image for better reading. The actual images have more extensive object labels. Figure 6: Visualization of the sentence embeddings of the user utterances used for training. The t-SNE visualizations after half-way through the utterances are shown. The utterances that have the same dialogue acts can be seen grouping together. This shows that the complete utterance is not always needed to identify the correct dialogue act.