GCMA6D: Graph Convolution and Cross-Modality Attention Fusion for 6D Pose Estimation

GCMA6D is an RGB-D 6D pose estimation network for object recognition and localization in complex scenes. It targets cases where existing pose pipelines struggle: occlusion, low texture, cluttered backgrounds, and weak local geometry.
In the broader publication record, this work sits in Proceedings of the Australasian Conference on Robotics and Automation (ACRA 2025) and connects to practical problems in 3D sensing, computational geometry, and industrial machine vision.
Problem setting
6D pose estimation is difficult for occluded or low-texture objects. GCMA6D addresses this with an RGB-D network that extracts local geometry through a 3DGCN point-cloud branch, enhances image features with Large Kernel Attention, and fuses RGB and geometric features through Cross-Modality Attention plus Squeeze-and-Excitation reweighting. Experiments on LineMOD and YCB-Video show improved accuracy over DenseFusion-style baselines.
In the broader publication record, this work appears in Proceedings of the Australasian Conference on Robotics and Automation (ACRA 2025). The visual notes below pair the paper’s original figures with a concise reading of the method, experimental setup, and reported results.
Method and visual evidence
The method works on 3D geometric observations such as point clouds, poses, correspondences, or segmented regions, then uses the proposed representation to improve robustness under noise, viewpoint change, or limited observations.
The extracted figures below show the geometric representation, network or optimization pipeline, and qualitative or quantitative results.

Method overview. This image is extracted from an embedded PDF image object on page 3, then recomposed for web display.

Representation and setup. This image is extracted from an embedded PDF image object on page 3, then recomposed for web display.

Experimental evidence. This image is extracted from an embedded PDF image object on page 4, then recomposed for web display.

Result comparison. This image is extracted from an embedded PDF image object on page 4, then recomposed for web display.

Additional visual result. This image is extracted from an embedded PDF image object on page 10, then recomposed for web display.
Results and impact
The evaluation reported in Proceedings of the Australasian Conference on Robotics and Automation (ACRA 2025) uses the extracted figures above to show the method’s measurement, reconstruction, segmentation, matching, or diagnostic behavior on representative experiments. These visuals are paired with the paper’s quantitative or qualitative analysis to make the workflow easier to inspect from the homepage.
Source handling
I extracted 32 candidate image objects from paper.pdf and generated the compressed WebP figures used on this page. The local PDF was also optimized from 2,145,667 bytes to 2,141,399 bytes.