In this talk I will show how we can design modular agents for visual navigation that can perform tasks specified by natural language instructions, perform efficient exploration and long-term planning, build and utilize 3D semantic maps, while generalizing across domains and tasks. Specifically, I will first introduce a novel framework that builds and utilizes 3D semantic maps to learn both action and perception in a completely self-supervised manner. I will show that the new framework can be used to close the action-perception loop: it improves object detection and instance segmentation performance of a pretrained perception model by moving around in training environments, while the improved perception model can be used to improve on object goal navigation tasks. In the second part of the talk, I will introduce a method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images, I will show that the model is able to achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase its interactive abilities.
|