Interactive Language Acquisition with One-shot Visual Concept Learning through a Conversational Game

Building intelligent agents that can communicate with and learn from humans in natural language is of great value. Supervised language learning is limited by the ability of capturing mainly the statistics of training data, and is hardly adaptive to new scenarios or flexible for acquiring new knowledge without inefficient retraining or catastrophic forgetting. We highlight the perspective that conversational interaction serves as a natural interface both for language learning and for novel knowledge acquisition and propose a joint imitation and reinforcement approach for grounded language learning through an interactive conversational game. The agent trained with this approach is able to actively acquire information by asking questions about novel objects and use the just-learned knowledge in subsequent conversations in a one-shot fashion. Results compared with other methods verified the effectiveness of the proposed approach.


Introduction
Language is one of the most natural forms of communication for human and is typically viewed as fundamental to human intelligence; therefore it is crucial for an intelligent agent to be able to use language to communicate with human as well.While supervised training with deep neural networks has led to encouraging progress in language learning, it suffers from the problem of capturing mainly the statistics of training data, and from a lack of adaptiveness to new scenarios and being flexible for acquiring new knowledge without inefficient retraining or catastrophic forgetting.Moreover, supervised training of deep neural network mod-els needs a large number of training samples while many interesting applications require rapid learning from a small amount of data, which poses an even greater challenge to the supervised setting.
In contrast, humans learn in a way very different from the supervised setting (Skinner, 1957;Kuhl, 2004).First, humans act upon the world and learn from the consequences of their actions (Skinner, 1957;Kuhl, 2004;Petursdottir and Mellor, 2016).While for mechanical actions such as movement, the consequences mainly follow geometrical and mechanical principles, for language, humans act by speaking, and the consequence is typically a response in the form of verbal and other behavioral feedback (e.g., nodding) from the conversation partner (i.e., teacher).These types of feedback typically contain informative signals on how to improve language skills in subsequent conversations and play an important role in humans' language acquisition process (Kuhl, 2004;Petursdottir and Mellor, 2016).Second, humans have shown a celebrated ability to learn new concepts from small amount of data (Borovsky et al., 2003).From even just one example, children seem to be able to make inferences and draw plausible boundaries between concepts, demonstrating the ability of one-shot learning (Lake et al., 2011).
The language acquisition process and the oneshot learning ability of human beings are both impressive as a manifestation of human intelligence, and are inspiring for designing novel settings and algorithms for computational language learning.In this paper, we leverage conversation as both an interactive environment for language learning (Skinner, 1957) and a natural interface for acquiring new knowledge (Baker et al., 2002).We propose an approach for interactive language acquisition with one-shot concept learning ability.The proposed approach allows an agent to learn grounded language from scratch, acquire the trans-2 Related Work Supervised Language Learning.Deep neural network-based language learning has seen great success on many applications, including machine translation (Cho et al., 2014b), dialogue generation (Wen et al., 2015;Serban et al., 2016), image captioning and visual question answering (? Antol et al., 2015).For training, a large amount of labeled data is needed, requiring significant efforts to collect.Moreover, this setting essentially captures the statistics of training data and does not respect the interactive nature of language learning, rendering it less flexible for acquiring new knowledge without retraining or forgetting (Stent and Bangalore, 2014).
Reinforcement Learning for Sequences.Some recent studies used reinforcement learning (RL) to tune the performance of a pre-trained language model according to certain metrics (Ranzato et al., 2016;Bahdanau et al., 2017;Li et al., 2016;Yu et al., 2017).Our work is also related to RL in natural language action space (He et al., 2016) and shares a similar motivation with Weston (2016) and Li et al. (2017), which explored language learning through pure textual dialogues.However, in these works (He et al., 2016;Weston, 2016;Li et al., 2017), a set of candidate sequences is provided and the action is to select one from the set.Our main focus is rather on learning language from scratch: the agent has to learn to generate a sequence action rather than to simply select one from a provided candidate set.

Communication and Emergence of Language.
Recent studies have examined learning to communicate (Foerster et al., 2016;Sukhbaatar et al., 2016) and invent language (Lazaridou et al., 2017;Mordatch and Abbeel, 2018).The emerged language needs to be interpreted by humans via postprocessing (Mordatch and Abbeel, 2018).We, however, aim to achieve language learning from the dual perspectives of understanding and generation, and the speaking action of the agent is readily understandable without any post-processing.Some studies on language learning have used a guesser-responder setting in which the guesser tries to achieve the final goal (e.g., classification) by collecting additional information through asking the responder questions (Strub et al., 2017;Das et al., 2017).These works try to optimize the question being asked to help the guesser achieve the final goal, while we focus on transferable speaking and one-shot ability.
One-shot Learning and Active Learning.Oneshot learning has been investigated in some recent works (Lake et al., 2011;Santoro et al., 2016;Woodward and Finn, 2016).The memoryaugmented network (Santoro et al., 2016) stores visual representations mixed with ground truth class labels in an external memory for one-shot learning.A class label is always provided following the presentation of an image; thus the agent receives information from the teacher in a passive way.Woodward and Finn (2016) present efforts toward active learning, using a vanilla recurrent neural network (RNN) without an external memory.Both lines of study focus on image classification only, meaning the class label is directly provided for memorization.In contrast, we target language and one-shot learning via conversational interaction, and the learner has to learn to extract important information from the teacher's sentences for memorization.

The Conversational Game
We construct a conversational game inspired by experiments on language development in infants from cognitive science (Waxman, 2004).The game is implemented with the XWORLD simulator (Yu et al., 2018;Zhang et al., 2017) and is publicly available online.1 It provides an environment for the agent2 to learn language and develop the one-shot learning ability.One-shot learning here means that during test sessions, no further training happens to the agent and it has to answer teacher's questions correctly about novel images of neverbefore-seen classes after being taught only once by the teacher, as illustrated in Figure 1.To succeed in this game, the agent has to learn to 1) speak by generating sentences, 2) extract and memorize useful information with only one exposure and use it in subsequent conversations, and 3) behave adaptively according to context and its own knowledge (e.g., asking questions about unknown objects and answering questions about something known), all achieved through interacting with the Figure 1: Interactive language and one-shot concept learning.Within a session S l , the teacher may ask questions, answer learner's questions, make statements, or say nothing.The teacher also provides reward feedback based on learner's responses as (dis-)encouragement.The learner alternates between interpreting teacher's sentences and generating a response through interpreter and speaker.Left: Initially, the learner can barely say anything meaningful.Middle: Later it can produce meaningful responses for interaction.Right: After training, when confronted with an image of cherry, which is a novel class that the learner never saw before during training, the learner can ask a question about it ("what is it") and generate a correct statement ("this is cherry") for another instance of cherry after only being taught once.
teacher.This makes our game distinct from other seemingly relevant games, in which the agent cannot speak (Wang et al., 2016) or "speaks" by selecting a candidate from a provided set (He et al., 2016;Weston, 2016;Li et al., 2017) rather than generating sentences by itself, or games mainly focus on slow learning (Das et al., 2017;Strub et al., 2017) and falls short on one-shot learning.
In this game, sessions (S l ) are randomly instantiated during interaction.Testing sessions are constructed with a separate dataset with concepts that never appear before during training to evaluate the language and one-shot learning ability.Within a session, the teacher randomly selects an object and interacts with the learner about the object by randomly 1) posing a question (e.g., "what is this"), 2) saying nothing (i.e., "") or 3) making a statement (e.g., "this is monkey").When the teacher asks a question or says nothing, i) if the learner raises a question, the teacher will provide a statement about the object asked (e.g., "it is frog") with a question-asking reward (+0.1); ii) if the learner says nothing, the teacher will still provide an answer (e.g., "this is elephant") but with an incorrect-reply reward (−1) to discourage the learner from remaining silent; iii) for all other incorrect responses from the learner, the teacher will provide an incorrect-reply reward and move on to the next random object for interaction.When the teacher generates a statement, the learner will receive no reward if a correct statement is generated otherwise an incorrect-reply reward will be given.The session ends if the learner answers the teacher's question correctly, generates a correct statement when the teacher says nothing (receiving a correct-answer reward +1), or when the maximum number of steps is reached.The sentence from teacher at each time step is generated using a context-free grammar as shown in Table 1.
A success is reached if the learner behaves correctly during the whole session: asking questions about novel objects, generating answers when asked, and making statements when the teacher says nothing about objects that have been taught within the session.Otherwise it is a failure.
4 Interactive Language Acquisition via Joint Imitation and Reinforcement Motivation.The goal is to learn to converse and develop the one-shot learning ability by conversing with a teacher and improving from teacher's feedback.We propose to use a joint imitation and reinforce approach to achieve this goal.Imitation helps the agent to develop the basic ability to generate sensible sentences.As learning is done by observing the teacher's behaviors during conversion, the agent essentially imitates the teacher from a third-person perspective (Stadie et al., 2017) rather than imitating an expert agent who is conversing with the teacher (Das et al., 2017;Strub et al., 2017).During conversations, the agent perceives sentences and images without any explicit labeling of ground truth answers, and it has to learn to make sense of raw perceptions, extract useful information, and save it for later use when generating an answer to teacher's question.While it is tempting to purely imitate the teacher, the agent trained this way only develops echoic behavior (Skinner, 1957), i.e., mimicry.
Reinforce leverages confirmative feedback from the teacher for learning to converse adaptively beyond mimicry by adjusting the action policy.It enables the learner to use the acquired speaking ability and adapt it according to reward feedback.This is analogous to some views on the babies' language-learning process that babies use the acquired speaking skills by trial and error with parents and improve according to the consequences of speaking actions (Skinner, 1957;Petursdottir and Mellor, 2016).The fact that babies don't fully develop the speaking capabilities without the ability to hear (Houston and Miyamoto, 2011), and that it is hard to make a meaningful conversation with a trained parrot signifies the importance of both imitation and reinforcement in language learning.
Formulation.The agent's response can be modeled as a sample from a probability distribution over the possible sequences.Specifically, for one session, given the visual input v t and conversation history H t ={w 1 , a 1 , • • • , w t }, the agent's response a t can be generated by sampling from a distribution of the speaking action a t ∼ p S θ (a|H t , v t ).The agent interacts with the teacher by outputting the utterance a t and receives feedback from the teacher in the next step, with w t+1 a sentence as verbal feedback and r t+1 reward feedback (with positive values as encouragement while negative values as discouragement, according to a t , as described in Section 3).Central to the goal is learning p S θ (•).We formulate the problem as the minimization of a cost function as: is the expectation over all the sentences W from teacher, γ is a reward discount factor, and [γ] t denotes the exponentiation over γ.
While the imitation term learns directly the predictive distribution p I θ (w t |H t−1 , a t ), it contributes to p S θ (•) through parameter sharing between them.
Architecture.The learner comprises four major components: external memory, interpreter, speaker, and controller, as shown in Figure 2. External memory is flexible for storing and retrieving information (Graves et al., 2014;Santoro et al., 2016), making it a natural component of our network for one-shot learning.The interpreter is responsible for interpreting the teacher's sentences, extracting information from the perceived signals, and saving it to the external memory.The speaker is in charge of generating sentence responses with reading access to the external memory.The response could be a question asking for information or a statement answering a teacher's question, leveraging the information stored in the external memory.The controller modulates the behavior of the speaker to generate responses according to context (e.g., the learner's knowledge status).
At time step t, the interpreter uses an interpreter-RNN to encode the input sentence w t from the teacher as well as historical conversational information into a state vector h t I .h t I is then passed through a residue-structured network, which is an identity mapping augmented with a learnable controller f (•) implemented with fully connected layers for producing c t .Finally, c t is used as the initial state of the speaker-RNN for generating the response a t .The final state h t last of the speaker-RNN will be used as the initial state of the interpreter-RNN at the next time step.

Imitation with Memory Augmented
Neural Network for Echoic Behavior The teacher's way of speaking provides a source for the agent to imitate.For example, the syntax for composing a sentence is a useful skill the agent can learn from the teacher's sentences, which could benefit both interpreter and speaker.learner where h t−1 last is the last state of the RNN at time step t−1 as the summarization of {H t−1 , a t−1 } (c.f., Figure 2), and i indexes words within a sentence.
It is natural to model the probability of the i-th word in the t-th sentence with an RNN, where the sentences up to t and words up to i within the t-th sentence are captured by a fixed-length state vector h t i = RNN(h t i−1 , w t i ).To incorporate knowledge learned and stored in the external memory, the generation of the next word is adaptively based on i) the predictive distribution of the next word from the state of the RNN to capture the syntactic structure of sentences, and ii) the information from the external memory to represent the previously learned knowledge, via a fusion gate g: where p h = softmax E T f MLP (h t i ) and p r = softmax E T r .E∈R d×k is the word embedding table, with d the embedding dimension and k the vocabulary size.r is a vector read out from the external memory using a visual key as detailed in the next section.f MLP (•) is a multi-layer Multi-Layer Perceptron (MLP) for bridging the semantic gap between the RNN state space and the word embedding space.The fusion gate g is computed as g = f (h t i , c), where c is the confidence score c=max(E T r), and a well-learned concept should have a large score by design (Appendix A.2). Multimodal Associative Memory.We use a multimodal memory for storing visual (v) and sentence (s) features with each modality while preserving the correspondence between them (Baddeley, 1992).Information organization is more structured than the single modality memory as used in Santoro et al. (2016) and cross modality retrieval is straightforward under this design.A visual encoder implemented as a convolutional neural network followed by fully connected layers is used to encode the visual image v into a visual key k v , and then the corresponding sentence feature can be retrieved from the memory as: (3) M v and M s are memories for visual and sentence modalities with the same number of slots (columns).Memory read is implemented as r = M s α with α a soft reading weight obtained through the visual modality by calculating the cosine similarities between k v and slots of M v .Memory write is similar to Neural Turing Machine (Graves et al., 2014), but with a content importance gate g mem to adaptively control whether the content c should be written into memory: For the visual modality c v k v .For the sentence modality, c s has to be selectively extracted from the sentence generated by the teacher.We use an attention mechanism to achieve this by c s =Wη, where W denotes the matrix with columns being the embedding vectors of all the words in the sentence.η is a normalized attention vector representing the relative importance of each word in the sentence as measured by the cosine similarity between the sentence representation vector and each word's context vector, computed using a bidirectional-RNN.The scalar-valued content importance gate g mem is computed as a function of the sentence from the teacher, meaning that the importance of the content to be written into memory depends on the content itself (c.f., Appendix A.3 for more details).The memory write is achieved with an erase and an add operation: denotes Hadamard product and the write location β is determined with a Least Recently Used Access mechanism (Santoro et al., 2016).

Context-adaptive Behavior Shaping through Reinforcement Learning
Imitation fosters the basic language ability for generating echoic behavior (Skinner, 1957), but it is not enough for conversing adaptively with the teacher according to context and the knowledge state of the learner.Thus we leverage reward feedback to shape the behavior of the agent by optimizing the policy using RL.The agent's response a t is generated by the speaker, which can be modeled as a sample from a distribution over all possible sequences, given the conversation history As H t can be encoded by the interpreter-RNN as h t I , the action policy can be represented as p S θ (a|h t I , v t ).To leverage the language skill that is learned via imitation through the interpreter, we can generate the sentence by implementing the speaker with an RNN, sharing parameters with the interpreter-RNN, but with a conditional signal modulated by a controller network (Figure 2): The reason for using a controller f (•) for modulation is that the basic language model only offers the learner the echoic ability to generate a sentence, but not necessarily the adaptive behavior according to context (e.g.asking questions when facing novel objects and providing an answer for a previously learned object according to its own knowledge state).Without any additional module or learning signals, the agent's behaviors would be the same as those of the teacher because of parameter sharing; thus, it is difficult for the agent to learn to speak in an adaptive manner.
To learn from consequences of speaking actions, the policy p S θ (•) is adjusted by maximizing expected future reward as represented by L R θ .As a non-differentiable sampling operation is involved in Eqn.(4), policy gradient theorem (Sutton and Barto, 1998) is used to derive the gradient for updating p S θ (•) in the reinforce module: where ) is the advantage (Sutton and Barto, 1998) estimated using a value network V (•).The imitation module contributes by implementing L I θ with a crossentropy loss (Ranzato et al., 2016) and minimizing it with respect to the parameters in p I θ (•), which are shared with p S θ (•).The training signal from imitation takes the shortcut connection without going through the controller.More details on f (•), V (•) are provided in Appendix A.2.

Experiments
We conduct experiments with comparison to baseline approaches.We first experiment with a wordlevel task in which the teacher and the learner communicate a single word each time.We then investigate the impact of image variations on concept learning.We further perform evaluation on the more challenging sentence-level task in which the teacher and the agent communicate in the form of sentences with varying lengths.
Setup.To evaluate the performance in learning a transferable ability, rather than the ability of fitting a particular dataset, we use an Animal dataset for training and test the trained models on a Fruit dataset (Figure 1).More details on the datasets are provided in Appendix A.1.Each session consists of two randomly sampled classes, and the maximum number of interaction steps is six.Baselines.The following methods are compared: • Reinforce: a baseline model with the same network structure as the proposed model and trained using RL only, i.e. minimizing L R θ ; • Imitation: a recurrent encoder decoder (Serban et al., 2016) model with the same structure as ours and trained via imitation (minimizing L I θ ); • Imitation+Gaussian-RL: a joint imitation and reinforcement method using a Gaussian policy (Duan et al., 2016) in the latent space of the control vector c t (Zhang et al., 2017).The policy is changed by modifying the control vector c t the action policy depends upon.
Training Details.The training algorithm is implemented with the deep learning platform PaddlePaddle.3 The whole network is trained from scratch in an end-to-end fashion.The network is randomly initialized without any pre-training and is trained with decayed Adagrad (Duchi et al., 2011).We use a batch size of 16, a learning rate of 1×10 −5 and a weight decay rate of 1.6 × 10 −3 .We also exploit experience replay (Wang et al., 2017;Yu et al., 2018).The reward discount factor γ is 0.99, the word embedding dimension d is 1024 and the dictionary size k is 80. Figure 5: Test success rate and reward for the word-level task on the Fruit dataset under different test image variation ratios for models trained on the Animal dataset with a variation ratio of 0.5 (solid lines) and without variation (dashed lines).
used in both training and testing for Imitation and Imitation+Gaussian-RL baselines.

Word-Level Task
In this experiment, we focus on a word-level task, which offers an opportunity to analyze and understand the underlying behavior of different algorithms while being free from distracting factors.Note that although the teacher speaks a word each time, the learner still has to learn to generate a fullsentence ended with an end-of-sentence symbol.
Figure 3 shows the evolution curves of the rewards during training for different approaches.It is observed that Reinforce makes very little progress, mainly due to the difficulty of exploration in the large space of sequence actions.Imitation obtains higher rewards than Reinforce during training, as it can avoid some penalty by generating sensible sentences such as questions.Imitation+Gaussian-RL gets higher rewards than both Imitation and Reinforce, indicating that the RL component reshapes the action policy toward higher rewards.However, as the Gaussian policy optimizes the action policy indirectly in a latent feature space, it is less efficient for exploration and learning.Proposed achieves the highest final reward during training.
We train the models using the Animal dataset and evaluate them on the Fruit dataset; Figure 4 sum- marizes the success rate and average reward over 1K testing sessions.As can be observed, Reinforce achieves the lowest success rate (0.0%) and reward (−6.0) due to its inherent inefficiency in learning.Imitation performs better than Reinforce in terms of both its success rate (28.6%) and reward value (−2.7).Imitation+Gaussian-RL achieves a higher reward (−1.2) during testing, but its success rate (32.1%) is similar to that of Imitation, mainly due to the rigorous criteria for success.Proposed reaches the highest success rate (97.4%) and average reward (+1.1)4 , outperforming all baseline methods by a large margin.
From this experiment, it is clear that imitation with a proper usage of reinforcement is crucial for achieving adaptive behaviors (e.g., asking questions about novel objects and generating answers or statements about learned objects proactively).

Learning with Image Variations
To evaluate the impact of within-class image variations on one-shot concept learning, we train models with and without image variations, and during testing compare their performance under different image variation ratios (the chance of a novel image instance being present within a session) as shown in Figure 5.It is observed that the performance of the model trained without image variations drops significantly as the variation ratio increases.We also evaluate the performance of models trained under a variation ratio of 0.5.Figure 5 clearly shows that although there is also a performance drop, which is expected, the performance degrades more gradually, indicating the importance of image variation for learning one-shot concepts.Figure 6 visualizes sampled training and testing images represented by their corresponding features extracted using the visual encoder trained without and with image variations.Clusters of visually similar concepts emerge in the feature space when trained with image variations, indicating that a more discriminative visual encoder was obtained for learning generalizable concepts.

Sentence-Level Task
We further evaluate the model on sentence-level tasks.Teacher's sentences are generated using the grammar as shown in Table 1 and have a number of variations with sentence lengths ranging from one to five.Example sentences from the teacher are presented in Appendix A.1.This task is more challenging than the word-level task in two ways: i) information processing is more difficult as the learner has to learn to extract useful information which could appear at different locations of the sentence; ii) the sentence generation is also more difficult than the word-level task and the learner has to adaptively fuse information from RNN and external memory to generate a complete sentence.
Comparison of different approaches in terms of their success rates and average rewards on the novel test set are shown in Figure 8.As can be observed from the figure, Proposed again outperforms all other compared methods in terms of both success rate (82.8%) and average reward (+0.8), demonstrating its effectiveness even for the more complex sentence-level task.
We also visualize the information extraction and the adaptive sentence composing process of the proposed approach when applied to a test set.As shown in Figure 7, the agent learns to extract useful information from the teacher's sentence and use the content importance gate to control what content is written into the external memory.Concretely, sentences containing object names have a larger g mem value, and the word corresponding to object name has a larger value in the attention vector η compared to other words in the sentence.The combined effect of η and g mem suggests that words corresponding to object names have higher likelihoods of being written into the external memory.The agent also successfully learns to use the external memory for storing the information extracted from the teacher's sentence, to fuse it adaptively with the signal from the RNN (capturing the syntactic structure) and to generate a complete sentence with the new concept included.The value of the fusion gate g is small when generating words like "what,", "i," "can," and "see," meaning it mainly relies on the signal from the RNN for generation (c.f., Eqn.(2) and Figure 7).In contrast, when generating object names (e.g., "banana," and "cucumber"), the fusion gate g has a large value, meaning that there is more emphasis on the signal from the external memory.This experiment showed that the proposed approach is applicable to the more complex sentence-level task for language learning and one-shot learning.More interestingly, it learns an interpretable operational process, which can be easily understood.More results including example dialogues from different approaches are presented in Appendix A.4.

Discussion
We have presented an approach for grounded language acquisition with one-shot visual concept learning in this work.This is achieved by purely interacting with a teacher and learning from feedback arising naturally during interaction through joint imitation and reinforcement learning, with a memory augmented neural network.Experimental results show that the proposed approach is effective for language acquisition with one-shot visual concept learning across several different settings compared with several baseline approaches.
In the current work, we have designed and used a computer game (synthetic task with synthetic language) for training the agent.This is mainly due to the fact that there is no existing dataset to the best of our knowledge that is adequate for developing our addressed interactive language learning and one-shot learning problem.For our current design, although it is an artificial game, there is a reasonable amount of variations both within and across sessions, e.g., the object classes to be learned within a session, the presentation order of the selected classes, the sentence patterns and image instances to be used etc.All these factors contribute to the increased complexity of the learning task, making it non-trivial and already very challenging to existing approaches as shown by the experimental results.While offering flexibility in training, one downside of using a synthetic task is its limited amount of variation compared with real-world scenarios with natural languages.Although it might be non-trivial to extend the proposed approach to real natural language directly, we regard this work as an initial step towards this ultimate ambitious goal and our game might shed some light on designing more advanced games or performing real-world data collection.We plan to investigate the generalization and application of the proposed approach to more realistic environments with more diverse tasks in future work.

A.1 Datasets and Example Sentences
The Animal dataset contains 40 animal classes with 408 images in total, with about 10 images per class on average.The Fruit dataset contains 16 classes and 48 images in total with 3 images per class.The object classes and images are summarized in Table 2 and Figure 9. Example sentences from the teacher in different cases (questioning, answering, and saying nothing) are presented in Table 3.

A.2.3 Fusion Gate
The fusion gate g is implemented as two FC layers with ReLU activations a third FC layer with a sigmoid activation.The output dimensions are 50, 10 and 1 for each layer respectively.

A.2.4 Controller
The controller f (•) together with the identity mapping forms a residue-structured network as f (•) is implemented as two FC layers with ReLU activations and a third FC layer with a linear activation, all having an output dimensions of 1024.

A.2.5 Value Network
The value network is introduced to estimate the expected accumulated future reward.It takes the state vector of interpreter-RNN h I and the confidence c as input.It is implemented as two FC layers with ReLU activations and output dimensions of 512 and 204 respectively.The third layer is another FC layer with a linear activation and an output dimension of 1.It is trained by minimizing a cost as (Sutton and Barto, 1998) V (•) denotes a target version of the value network, whose parameters remain fixed until copied from V (•) periodically (Mnih et al., 2013).

A.2.6 Confidence Score
The confidence score c is defined as follows: where E∈R d×k is the word embedding table, with d the embedding dimension and k the vocabulary size.r∈R d is the vector read out from the sentence modality of the external memory as: where α a soft reading weight obtained through the visual modality by calculating the cosine similarities between k v and the slots of M v .The content stored in the memory is extracted from teacher where w i ∈R d denotes the embedding vector extracted from the word embedding table E for the word w i .Therefore, for a well-learned concept with effective η for information extraction and effective α for information retrieval, r should be an embedding vector mainly corresponding to the label word associated with the visual image.Therefore, the value of c should be large and the maximum is reached at the location where that label word resides in the embedding table.For a completely novel concept, as the memory contains no information about it, the reading attention α will not be focused and thus r would be an averaging of a set of existing word embedding vectors in the external memory, leading to a small c value.

A.3.1 Content Extraction
We use an attention scheme to extract useful information from a sentence to be written into memory.Given a sentence w = {w (11) The context vector is the concatenation of the word embedding vector and the state vectors of both forward and backward passes: The word level attention η = [η 1 , η 2 , • • • , η i , • • • ] is computed as the cosine similarity between transformed sentence summary vector s and each context vector wi : Both MLPs contain two FC layers with output dimensions of 1024 and a linear and a Tanh activation for each layer respectively.The content c s to be written into the memory is computed as:

A.3.2 Importance Gate
The content importance gate is computed as g mem =σ(f MLP (s)), meaning that the importance of the content to be written into the memory depends on the sentence from the teacher.The MLP contains two FC layers with ReLU activation and output dimensions of 50 and 30 respectively.Another FC layer with a linear activation, and an output dimension of 20 is used.The output layer is an FC layer with an output dimension of 1 and a sigmoid activation σ .

A.4 Example Dialogues on Novel Data
We train models on the Animal dataset and perform the evaluation on the Fruit dataset. start

Figure 2 :
Figure 2: Network structure.(a) Illustration of the overall architecture.At each time step, the learner uses the interpreter module to encode the teacher's sentence.The visual perception is also encoded and used as a key to retrieve information from the external memory.The last state of the interpreter-RNN will be passed through a controller.The controller's output will be added to the input and used as the initial state of the speaker-RNN.The interpreter-RNN will update the external memory with an importance (illustrated with transparency) weighted information extracted from the perception input.'Mix' denotes a mixture of word embedding vectors.(b) The structures of the interpreter-RNN (top) and the speaker-RNN (bottom).The interpreter-RNN and speaker-RNN share parameters.

Figure 3 :
Figure 3: Evolution of reward during training for the word-level task without image variations.

Figure 6 :Figure 7 :
Figure 6: Visualization of the CNN features with t-SNE.Ten classes randomly sampled from (a-b) the Animal dataset and (c-d) the Fruit dataset, with features extracted using the visual encoder trained without (a, c) and with (b, d) image variations on the the Animal dataset.Teacher Learner Interpreter Speaker
see pineapple T: i can see pumpkin L: pumpkin T: L: grape SUCCESS T: what L: tell what you observe T: i can see avocado L: avocado T: this is coconut L: coconut T: what is there L: this is avocado SUCCESS T: it is coconut L: i can see coconut T: this is pumpkin L: i can see pumpkin T: what is it L: i can see pumpkin T: what do you see L: i can see pumpkin FAILURE

Table 1 :
Grammar for the teacher's sentences.
The visual image size is 32 × 32, the maximum length of generated sentence is 6 and the memory size is 10.Test performance for the word-level task without image variations.Models are trained on the Animal dataset and tested on the Fruit dataset.
during training for exploration while a max operation is used during testing both for Proposed and for Reinforce baseline.The max operation is 3 https://github.com/PaddlePaddle/Paddle

Table 2 :
Object classes for two datasets.

Table 3 :
Example sentences from the teacher.

Table 4 :
Example dialogues of different approaches are shown in Table 4.It is observed that Reinforce arrives at a policy that the learner keeps silent.Both Imitation and Imitation+Gaussian-RL can generate sensible sentences, but cannot speak adaptively according to context.Proposed can speak according to context adaptively, asking information about novel classes, and answering questions after being taught only once by the teacher.Example dialogues from different approaches.