Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution

Research output: Contribution to conferencePaperAcademic

156 Downloads (Pure)


Service robots should be able to interact naturally with non-expert human users, not only to help
them in various tasks, but also to receive guidance in order to resolve ambiguities that might be
present in the instruction. We consider the task of visual grounding, where the agent segments an
object from a crowded scene given a natural language description. Modern holistic approaches to
visual grounding usually ignore language structure and struggle to cover generic domains, therefore
relying heavily on large datasets. Additionally, their transfer performance in RGB-D datasets suffers
due to high visual discrepancy between the benchmark and the target domains. Modular approaches
marry learning with domain modeling and exploit the compositional nature of language to decouple visual representation from language parsing, but either rely on external parsers or are trained in
an end-to-end fashion due to the lack of strong supervision. In this work, we seek to tackle these
limitations by introducing a fully decoupled modular framework for compositional visual grounding
of entities, attributes and spatial relations. We exploit rich scene graph annotations generated in a
synthetic domain and train each module independently in simulation. Our approach is evaluated
both in simulation and in two real RGB-D scene datasets. Experimental results show that the decoupled nature of our framework allows for easy integration with domain adaptation approaches for
Sim-To-Real visual recognition, offering a data-efficient, robust, and interpretable solution to visual
grounding in robotic applications.
Original languageEnglish
Number of pages18
Publication statusSubmitted - 2022
Event1st Conference on Lifelong Learning Agents - Montreal, Canada
Duration: 18-Aug-202224-Aug-2022


Conference1st Conference on Lifelong Learning Agents

Cite this