Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents

Peter Anderson

Abstract

Each time we ask for an object, describe a scene or follow directions, we are converting information between visual and linguistic representations. People do this with ease, typically without even noticing. Intelligent systems that perform useful tasks in unstructured situations, and interact with people, will also require this ability. In this talk, we will focus on the joint modeling of visual and linguistic information using deep neural networks. We will cover some recent advances in automatic image captioning, visual question answering (VQA), and vision and language navigation (VLN).

The material will be drawn from the following two papers to be presented at CVPR 2018:

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments