AI Sensing for Robotics using Deep Learning based Visual and Language Modeling

An artificial intelligence(AI) system should be capable of processing the sensory inputs to extract both task-specific and general information about its environment. However, most of the existing algorithms extract only task specific information. In this work, an innovative approach to address the problem of processing visual sensory data is presented by utilizing convolutional neural network (CNN). It recognizes and represents the physical and semantic nature of the surrounding in both human readable and machine processable format. This work utilizes the image captioning model to capture the semantics of the input image and a modular design to generate a probability distribution for semantic topics. It gives any autonomous system the ability to process visual information in a human-like way and generates more insights which are hardly possible with a conventional algorithm. Here a model and data collection method are proposed.


Introduction
In a world gifted with visible light facilitating information sharing, the living creatures have developed organs for sensing the light to understand their surrounding. In an autonomous system, this information is captured in IR, UV, and visible spectrum involving sophisticated sensors and is processed using complex algorithms. At Consumer Electronics Show (CES) 2020, Samsung presented Ballie which is a personalized robot with ideas to make it self-aware of its surroundings and control IoT devices around home to make the environment better. With companies targeting to launch smart home robots with capabilities of following voice commands, there is a need to develop a system that can automatically understand the semantics of the environment and take appropriate decisions on its own.  The latest work in scene understanding involved construction of knowledge graph for visual semantic understanding (Jiang et al., 2019). The authors used ontology graph in combination with visual captioning to describe the scene. Another approach for functional scene understanding was introduced using semantic segmentation (Wald et al., 2018). All these scene understanding approaches make a system specialized in certain tasks and working environment while failing to generalize across various types of situations and capture the human emotions.
It is efficient to make decision, based on a structured description of the scene instead of working on raw pixel information. Fig.1 shows scenarios where a human can easily interpret the meaning of the scene. It is easy to tell from the Fig.1b that firefighters are trying to put out the fire from building. This is also true for all the Fig.1a, 1b where a human can understand and explain the scene easily through a language representation.
In this work we recommend an AI sensing system that can semantically interpret the environmental conditions, objects, relations and activity carried out from the visual feed. These interpretations are converted into text for human understanding and probability distribution for the control system to process and take decisions. The main intention of this work is to have a neural network based sen-sor processing unit capable of extracting semantic context while deployed on low powered compute hardware.
This paper is divided as follows: • Section 2 explains various modules of the proposed approach.
• Section 3 discusses about the dataset and considerations to make while implementing this approach.

Proposed Approach
In this work, a modular approach is proposed to represent the semantic content of the outside world through vision sensor. A detailed flow diagram of the proposed method is shown in Fig.2. It consist of three sub-modules namely, CNN feature extractor, language module, and environment context probability detector module. It combines visual, language, and context detection modules to assist the control unit to make decisions based on non-task specific environment details.

CNN Feature Extractor
This module process the visual feed and convert them into feature tensor(f ) which is used to generate semantic understanding of the surrounding. This feature tensor(f ) encodes the information present in the incoming frame. A CNN based feature extractor (Xu et al., 2015) trained for image classification task on Imagenet dataset (Deng et al., 2009) is used. There are variety of CNN based pre-trained architectures are available to be used as feature extractors. Architectures such as Mobilenet (Sandler et al., 2018), ResNet (He et al., 2016), InceptionNet (Szegedy et al., 2015) and DenseNet (Huang et al., 2017) have their own benefits and drawbacks. Based on the deployment hardware, expected response time and environment nature, specific architecture can be chosen.

Language Module
In this module, the information from the feature tensor(f ) are extracted and represented in a human interpretable language(l). This is achieved by using Long Short Term Memory unit(LSTM) (Sak et al., 2014) which is a deep neural network(DNN) for generating sequential output (Xu et al., 2015). A combination of soft-attention mechanism (Xu et al., 2015) and LSTM is used to describe the contents extracted from the frame (Vinodababu, 2018). This is a recursive step where the execution comes to a halt when the end token < end > is predicted or maximum sentence length is reached.
Here R k is the vector of tokenized words in the vocabulary and (l) is the generated word sequence. The byproduct of having language representation is explainability of action.
The process of caption generation happens recursively were to sample a word w from R k it goes through the following process. Ref Fig.3.
At a time step t, • The attention mechanism computes the mask m t for feature tensorf using f and hidden state H t−1 .
• f weighted by m t combined with the previous word detected w t−1 is passed onto the LSTM along with hidden state H t−1 and cell state C t−1 from the previous step.
• The LSTM output a probability distribution for the words in the vocabulary R.
This process is carried out until the end token < end > is predicted or the max length of caption is reached. The effectiveness of this module depends on generation of dense caption for the scene.

Environment Context Detector
The verbal representation from language module is used to generate probability distribution over various groups of semantic context. The input sequence is tokanized, vectorized and converted into probability distribution by using fully connected network. It is constructed by single or multiple neural net operating parallel, perform prediction over various context. Fig.4 provides the overall view of this module where different fully connected network(FCN) are used for prediction. The caption are tokanized and vectorized to act as input. Here GloVe embedding (Pennington et al., 2014) is used to vectorize the sentence. The activation of the output layer can use either softmax or sigmoid based on the nature of the data. The topics of the context should be decided based on the workspace and   (Xu et al., 2015) preference of the robotics designer. Fig.4, shows environment context detector block diagram.
where c i is the prediction vector of i th context net and E is the collection of c vectors. Here d is the desired number of context net. The generated probability distribution is sent to the control system which takes the final decision whether to react or not. The proposed solution serves as an add-on to the existing control system.

Dataset and Considerations
The CNN feature extractor is a pre-trained model trained on Imagenet dataset (Deng et al., 2009) for classifying 1000 objects. The language module is trained using COCO image captioning dataset (Lin et al., 2014) which consist of image and captions in target language. A BLEU-1 score of 70.7 is achieved for the language module.
The dataset for the environment context module is similar to the text sentiment classification dataset. The input will be a sentence and the labels are one-hot vector of target class. A dataset is created from a portion of COCO caption where the semantic context topics are environment, situation, mood, presence of human, and objects in the scene as shown in Fig.4. There are several logical considerations to be take while adopting this method. few of them are, • On board compute capability to carryout DNN calculation.
• Robot deployment environment and its nature.
• The actual intention and task of the robot.
• How the control system should react to the generated probability distribution.

Conclusion
The main objective of the work is to use neural networks to understand and represent the physical environment around the system. This work serve as an add-on to the existing control system by providing additional set of inputs capturing the semantic meaning. An image captioning based approach is used to obtain semantic content of the surrounding and it is represented in a probability distribution.