A Neural, Interactive-predictive System for Multimodal Sequence to Sequence Tasks

We present a demonstration of a neural interactive-predictive system for tackling multimodal sequence to sequence tasks. The system generates text predictions to different sequence to sequence tasks: machine translation, image and video captioning. These predictions are revised by a human agent, who introduces corrections in the form of characters. The system reacts to each correction, providing alternative hypotheses, compelling with the feedback provided by the user. The final objective is to reduce the human effort required during this correction process. This system is implemented following a client-server architecture. For accessing the system, we developed a website, which communicates with the neural model, hosted in a local server. From this website, the different tasks can be tackled following the interactive–predictive framework. We open-source all the code developed for building this system. The demonstration in hosted in http://casmacat.prhlt.upv.es/interactive-seq2seq.


Introduction
The sequence to sequence problem involves the transduction of an input sequence x into an output sequenceŷ (Graves, 2012). In the last years, many tasks have been tackled under this perspective using neural networks with extraordinary results: neural machine translation (NMT; Sutskever et al., 2014;Bahdanau et al., 2015), speech recognition and translation (Chan et al., 2016;Niehues et al., 2018), image and video captioning (Xu et al., 2015;Yao et al., 2015), among others.
These systems are usually based on the statistical formalization of pattern recognition (e.g. Bishop, 2006). Following this probabilistic framework, the objective is to find most likely output se-quenceŷ, given an input sequence x, according to a model Θ:ŷ = arg max y p(y | x; Θ) (1) In the last years, Θ has been frequently implemented as a deep neural network, trained in an end-to-end way. These neural systems have consistently outperformed other alternatives in the aforementioned problems. However and despite these impressive advances, the systems are not perfect, and still make errors (Koehn and Knowles, 2017).
In several scenarios, and especially in machine translation, fully-automatic systems are usually used for providing initial predictions to the input objects. These predictions are later revised by a human expert, who corrects the errors made by the system. This is known as post-editing and, in some scenarios, it increases the productivity with respect to performing the task from scratch (Alabau et al., 2016;Arenas, 2008;Hu and Cadwell, 2016).

Interactive-predictive pattern recognition
As an alternative to the static, decoupled postediting, other strategies have been proposed, aiming to improve the productivity of the correction phase. Among them, the interactive-predictive pattern recognition (Foster et al., 1997) results particularly interesting. Under this framework, the static correction stage shifts to an iterative humancomputer collaboration process.
The user interacts with the system by means of a feedback signal f . The system suggests then an alternative hypothesisỹ, compatible with the feedback. The inclusion of the feedback into the general pattern recognition rewrites Eq. (1) introduc-

HTTP server
Object (x) PHP (ỹ) Figure 1: System architecture. The client, a website, presents the user several input objects (images, videos or texts) and a prediction. The user then introduces a feedback signal, for correcting this prediction. After being introduced, the feedback signal is sent to the server-together with the input object-for generating an alternative hypothesis, which takes into account the user corrections.
ing a restriction on the search space: The most paradigmatic application of the interactive-predictive pattern recognition framework is machine translation. The addition of interactive protocols to foster productivity of translation environments have been studied for long time, for phrase-based models (Alabau et al., 2013(Alabau et al., , 2016Barrachina et al., 2009;Federico et al., 2014;Green et al., 2014) and also for NMT systems (Knowles and Koehn, 2016;Peris et al., 2017;Peris and Casacuberta, 2019;Wuebker et al., 2016).
The system we are presenting in this work is an extended version of Peris and Casacuberta (2019), who presented a NMT system that accepted a prefix feedback: the user corrected the first wrong character of the sentence. Hence, the system reacted to the feedback by providing an alternative suffix. This protocol can be implemented as a constrained beam search. Moreover, the system can be retrained incrementally, as soon as a corrected sample is validated, following an online learning scenario.
We generalize this interactive-predictive NMT system to cope with alternative input modalities, namely images and videos. The system can be accessed following a client-server interface. We developed a client website, that access to our servers, in which the interactive-predictive systems are deployed. A live demo of the system can be accessed in: http://casmacat.prhlt.upv. es/interactive-seq2seq.
In the following sections, we describe the main architecture, features and usage of our interactivepredictive system. We also describe the frontend of our demonstration website and present an example of interactive session.

System description
The core of our system is a neural sequence to sequence model, developed with NMT-Keras (Peris and Casacuberta, 2018). This library is built upon Keras (Chollet et al., 2015) and works for the Theano (Theano Development Team, 2016) and Tensorflow (Abadi et al., 2016) backends. The system is deployed as a Python-based HTTP server that waits for requests. The user interactions are introduced through a (client) HTML website. The website is hosted on a Nginx server that manages the interactions using Javascript and communicates with the Python server, using the PHP curl tool. All code is open-source and publicly available 12 .
NMT-Keras extends the (already extensive) Keras functionalities, providing a flexible, easy to use framework upon which build neural models. Among the features brought by NMT-Keras, some of them are particularly useful for sequenceto-sequence tasks: extended recurrent neural networks, with embedded attention mechanisms and conditional LSTM/GRU units (Sennrich et al., 2017), multi-head attention layers, positional encodings and position-wise feed-forward networks for building Transformer models (Vaswani et Figure 2: Frontend of the client website. As the button "Transcript!" is clicked, an initial hypothesis for the input object-in this case, an image-appears in the right area. The user then introduces corrections of this text.
The system reacts to each translation, producing alternative hypotheses, always compliant with the user feedback. Once a correct caption of the image is reached, the user clicks in the "Accept translation" button, validating the hypothesis.
2017) and a modular handler for processing different data modalities, including text, images, videos or categorical labels. Within this framework, we built our neural systems, which are leveraged via our interactive client-server application. The neural systems are deployed in a server, waiting for requests. When the client ask for a prediction, they react, generate the prediction and deliver it back to the client.

Usage of the interactive system
Our interactive-predictive system works as follows: initially, an input object is presented to the user in the client website. The user requests an automatic prediction of it. Next, the client communicates the server via PHP. The server queries the neural system, which produces an initial hypothesis applying Eq. (1). The hypothesis is then sent back to the client website.
Next, the interactive-predictive process starts: the user searches in this hypothesis the first er-ror, and introduces a correction with the keyboard (writing one or more characters). When the user stops typing the correction, the system reacts, sending to the server a request containing the input object and the user feedback (the sequence of characters that conform the correct prefix). Then, the neural model implements Eq. (2) and produces an alternative hypothesis, such that it completes the correct prefix. This is implemented as a constrained beam search, as described in Peris et al. (2017); Peris and Casacuberta (2019). This iteration of the process is illustrated in Fig. 1.
This protocol is repeated until the user finds satisfactory the hypothesis given by the system. Then, it is validated. As soon as the sentence is validated, the system can be incrementally updated with this sample, following an online learning setup (Peris and Casacuberta, 2019). Hence, in future interactions, the system will be progressively updated, tailoring to a given domain or to the user preferences. These adaptive systems have 0 System A group of football players in red uniforms. 1 User A f group of football players in red uniforms. System A f ootball player in a red uniform is holding a football.

User
A football player in a red uniform is w holding a football. System A football player in a red uniform is wearing a football.

3
User A football player in a red uniform is wearing a h football. System A football player in a red uniform is wearing a helmet.

4
User A football player in a red uniform is wearing a helmet. Figure 3: Interactive-predictive session for correcting the caption generated in Fig. 2. At each iteration, the user introduces a character correction (boxed). The system modifies its hypothesis, taking into account this feedback: keeping the correct prefix (green) and generating a compatible suffix.
shown to be effective for reducing the human effort spent in the process (Karimova et al., 2018).

System showcase
To show the interactive-predictive protocol described in the previous sections, we developed a website which hosts a demonstration of the system. Our demonstration system handles three different problems, regarding three different data modalities: text-to-text (NMT), image-to-text (image captioning) and video-to-text (video captioning). For tackling these tasks, we use a similar model: a neural encoder-decoder, based on recurrent neural networks with attention (Bahdanau et al., 2015;Xu et al., 2015;Yao et al., 2015). Our framework has also support for Transformer-like architectures (Vaswani et al., 2017). The NMT task regards the translation of texts from a medical domain. The system is similar to the one used by Peris and Casacuberta (2019), and was trained on the UFAL corpus (Bojar et al., 2017). The image and video captioning systems were trained on the Flickr8k (Hodosh et al., 2010) and MSVD (Chen and Dolan, 2011) datasets, respectively. The images were encoded using an Inception convolutional neural network (Szegedy et al., 2016) trained on the ILSVRC dataset (Russakovsky et al., 2015). The decoder receives the representation previous to the fully-connected work. In the case of the video captioning system, we applied a 3D convolutional neural network (Tran et al., 2015), for obtaining time-aware features.
Finally, as aforementioned in previous sections, the systems can be retrained after the validation of each sample. In our demonstration, the systems are updated via gradient descent, but using a learning rate of 0, which prevents a degradation of the model due to accidental misuse.

Example: image captioning
We show and analyze an image captioning example. The NMT and video captioning tasks are similar. Fig. 2 shows the demo website, for the image captioning task. In the left part of the screen, the input object is shown, in this case, an image. As the user clicks in the "Transcript!" button, the system generates a caption of the image, displaying it in an editable area on the right part of the screen. The user can then introduce the desired corrections to this hypothesis. As a correction is introduced, the system reacts, providing an alternative caption, but always considering the feedback given by the user.
As can be seen in Fig. 2, the caption generated by the system has some errors. Fig. 3 an shows the interactive-predictive captioning session, for obtaining a correct sample. With three interactions, the system was able to obtain a correct caption for the image.
It is particularly interesting to observe that the system correctly accounts for the singular/plural concordance of the clause in red uniform(s), depending on the subject (A football player/A group of football players).

Conclusions and future work
We presented a demonstration of a interactivepredictive neural system for multimodal sequence to sequence tasks. We described its client-server architecture and developed a website for ease the usage of the system.
As future work, we would like to improve the frontend of our website. Inspecting the attributes of black-box neural models is a relevant research topic, and it is under active development (e.g. Zeiler and Fergus, 2014;Ancona et al., 2017). Visualizing these relevant attributes would help to understand the model predictions and behavior.
Moreover, a more sophisticated frontend would allow to implement interesting features, such as mapping the attention weights through the input sequence or the implementation of more complex interaction protocols, such as touch-based interaction (Marie and Max, 2015) or segment-based interaction (Peris et al., 2017). We intend to offer the different functionalities of the toolkit as REST services, for improving the reusability of the code. It is also planned to release the library in a Docker container in order to ease the deployment of future applications.