RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Authors: Qi Han, Yuxuan Cai, Xiangyu Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results suggest that a foundation model with decoupled features can achieve competitive performance across multiple downstream vision tasks such as image classification, semantic segmentation and object detection.
Researcher Affiliation Industry Qi Han MEGVII Technology Beijing, China hqer@foxmail.com Yuxuan Cai MEGVII Technology Beijing, China larryx.tsai@gmail.com Xiangyu Zhang MEGVII Technology Beijing, China zhangxiangyu@megvii.com
Pseudocode No The paper describes the architecture and operations with equations and diagrams but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We pre-train Rev Col V2 on Image Net-1K [25] dataset. To fully exploit the potential of Rev Col V2, we intermediately fine-tune the models on Image Net-22K [39] following [34, 12] after MIM pre-training. For semantic segmentation tasks, we evaluate Rev Col V2 backbones on ADE20K benchmarks [26] with Uper Net [46] and Mask2Former [47] framework. For object detection and instance segmentation task, we evaluate Rev Col V2 backbones on COCO [27] dataset with Mask R-CNN [55] and Cascade Mask R-CNN [56] framework. Except for the additional teacher, we use a larger dataset Laion400M[28], which contains about 400M unlabeled images in pre-training.
Dataset Splits No The paper mentions using datasets like Image Net-1K, Image Net-22K, ADE20K, and COCO for training and fine-tuning, but it does not explicitly provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, and test sets) beyond implicitly using standard benchmark practices.
Hardware Specification Yes We test the inference latency of variant models in Table 7. As described Rev Col V1 [1], fragmented memory access takes a large part of latency. We show throughput (#image/s) under the different batch size of Rev Col V2-L and Vi T-L on a single A100 GPU. We test training speed and memory cost on a single A100 (80GB) x 8 machine, with the same data-loader (implemented for our cluster). This allows Rev Col V2 can be pre-trained with limited resources, such as RTX3090 GPU.
Software Dependencies No The paper mentions using Adam W optimizer and implicitly PyTorch (when discussing "vanilla autograd function in Py Torch"), but it does not specify version numbers for any software, libraries, or frameworks used in the experiments.
Experiment Setup Yes The mask ratio is set as 75% with random sampling strategy and the reconstruction target is the normalized raw pixel from the original image. We pre-train 1600 epochs for Rev Col V2 models. The pre-training image size is 2242 and the pre-training optimization parameters are: batch size 4096, base learning rate 1.5e-4 for 256 batch-size and linear scaled up, Adam W with weight decay 0.05. More details can be found in supplementary material. (And various tables B.1 to B.5)