CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

Authors: Jiange Yang, Sheng Guo, Gangshan Wu, Limin Wang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on SUN RGB-D and NYUDv2 datasets demonstrate the effectiveness of our Co MAE for RGB and depth representation learning.
Researcher Affiliation Collaboration Jiange Yang1, Sheng Guo2, Gangshan Wu1, Limin Wang1* 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 MYbank, Ant Group, China
Pseudocode No The paper describes the methodology using text and equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No Code will be released at https://github.com/MCG-NJU/Co MAE.
Open Datasets Yes SUN RGB-D Dataset is the most popular RGB-D scene recognition dataset. It contains RGB-D images from NYU depth v2, Berkeley B3DO (Janoch et al. 2013), and SUN3D (Xiao, Owens, and Torralba 2013)... NYU Depth Dataset V2 (NYUDv2) contains 1,449 well labeled RGB-D images for scene recognition.
Dataset Splits Yes Following the official setting in (Song, Lichtenberg, and Xiao 2015), we only use the images from 19 major scene categories, containing 4,845 / 4,659 train / test images. (SUN RGB-D) ... 795 images are for training and 654 are for testing. (NYUDv2)
Hardware Specification Yes Our proposed Co MAE is implemented on the Pytorch toolbox, and we pre-train and fine-tune our models on eight TITAN Xp GPUs using Adam W (Loshchilov and Hutter 2019) optimizer with a weight decay 0.05.
Software Dependencies No The paper mentions implementation using 'Pytorch toolbox' but does not specify a version number or other software dependencies with their versions.
Experiment Setup Yes In hybrid pre-training stage, we separately train cross-modal patch-level contrastive learning for 75 epochs and multi-modal masked autoencoder for 1200 epochs. The base learning rate is set to 1.0 10 3 for pre-training and 5.0 10 4 for fine-tuning, as well as the batch size is set to 256 for pre-training and 768 for fine-tuning, the effective learning rate follows the linear scaling rule in (Goyal et al. 2017): lr = base lr batchsize 256. The warm-up and layer decay strategies are used to adjust the learning rate.