Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution

Research output: Working paperPreprintAcademic

199 Downloads (Pure)

Abstract

Service robots should be able to interact naturally with non-expert human users, not only to helpthem in various tasks, but also to receive guidance in order to resolve ambiguities that might bepresent in the instruction. We consider the task of visual grounding, where the agent segments anobject from a crowded scene given a natural language description. Modern holistic approaches tovisual grounding usually ignore language structure and struggle to cover generic domains, thereforerelying heavily on large datasets. Additionally, their transfer performance in RGB-D datasets suffersdue to high visual discrepancy between the benchmark and the target domains. Modular approachesmarry learning with domain modeling and exploit the compositional nature of language to decouple visual representation from language parsing, but either rely on external parsers or are trained inan end-to-end fashion due to the lack of strong supervision. In this work, we seek to tackle theselimitations by introducing a fully decoupled modular framework for compositional visual groundingof entities, attributes and spatial relations. We exploit rich scene graph annotations generated in asynthetic domain and train each module independently in simulation. Our approach is evaluatedboth in simulation and in two real RGB-D scene datasets. Experimental results show that the decoupled nature of our framework allows for easy integration with domain adaptation approaches forSim-To-Real visual recognition, offering a data-efficient, robust, and interpretable solution to visualgrounding in robotic applications.
Original languageEnglish
PublisherarXiv
Number of pages18
DOIs
Publication statusSubmitted - 10-Jul-2022

Fingerprint

Dive into the research topics of 'Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution'. Together they form a unique fingerprint.

Cite this