CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets
Authors: Jiange Yang, Sheng Guo, Gangshan Wu, Limin Wang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on SUN RGB-D and NYUDv2 datasets demonstrate the effectiveness of our Co MAE for RGB and depth representation learning. |
| Researcher Affiliation | Collaboration | Jiange Yang1, Sheng Guo2, Gangshan Wu1, Limin Wang1* 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 MYbank, Ant Group, China |
| Pseudocode | No | The paper describes the methodology using text and equations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Code will be released at https://github.com/MCG-NJU/Co MAE. |
| Open Datasets | Yes | SUN RGB-D Dataset is the most popular RGB-D scene recognition dataset. It contains RGB-D images from NYU depth v2, Berkeley B3DO (Janoch et al. 2013), and SUN3D (Xiao, Owens, and Torralba 2013)... NYU Depth Dataset V2 (NYUDv2) contains 1,449 well labeled RGB-D images for scene recognition. |
| Dataset Splits | Yes | Following the official setting in (Song, Lichtenberg, and Xiao 2015), we only use the images from 19 major scene categories, containing 4,845 / 4,659 train / test images. (SUN RGB-D) ... 795 images are for training and 654 are for testing. (NYUDv2) |
| Hardware Specification | Yes | Our proposed Co MAE is implemented on the Pytorch toolbox, and we pre-train and fine-tune our models on eight TITAN Xp GPUs using Adam W (Loshchilov and Hutter 2019) optimizer with a weight decay 0.05. |
| Software Dependencies | No | The paper mentions implementation using 'Pytorch toolbox' but does not specify a version number or other software dependencies with their versions. |
| Experiment Setup | Yes | In hybrid pre-training stage, we separately train cross-modal patch-level contrastive learning for 75 epochs and multi-modal masked autoencoder for 1200 epochs. The base learning rate is set to 1.0 10 3 for pre-training and 5.0 10 4 for fine-tuning, as well as the batch size is set to 256 for pre-training and 768 for fine-tuning, the effective learning rate follows the linear scaling rule in (Goyal et al. 2017): lr = base lr batchsize 256. The warm-up and layer decay strategies are used to adjust the learning rate. |