Enhancing Fine-Grained 3D Object Recognition Using Hybrid Multi-Modal Vision Transformer-CNN Models

Songsong Xiong*, Georgios Tziafas, Hamidreza Kasaei

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

7 Citations (Scopus)
25 Downloads (Pure)

Abstract

Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that our hybrid multi-modal model outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50% and 93.51% on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.

Original languageEnglish
Title of host publication2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5751-5757
Number of pages7
ISBN (Electronic)9781665491907
DOIs
Publication statusPublished - 13-Dec-2023
Event2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023 - Detroit, United States
Duration: 1-Oct-20235-Oct-2023

Publication series

NameIEEE International Conference on Intelligent Robots and Systems
ISSN (Print)2153-0858
ISSN (Electronic)2153-0866

Conference

Conference2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023
Country/TerritoryUnited States
CityDetroit
Period01/10/202305/10/2023

Fingerprint

Dive into the research topics of 'Enhancing Fine-Grained 3D Object Recognition Using Hybrid Multi-Modal Vision Transformer-CNN Models'. Together they form a unique fingerprint.

Cite this