RevColV2: Exploring Disentangled Representations in Masked Image Modeling
Authors: Qi Han, Yuxuan Cai, Xiangyu Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results suggest that a foundation model with decoupled features can achieve competitive performance across multiple downstream vision tasks such as image classification, semantic segmentation and object detection. |
| Researcher Affiliation | Industry | Qi Han MEGVII Technology Beijing, China hqer@foxmail.com Yuxuan Cai MEGVII Technology Beijing, China larryx.tsai@gmail.com Xiangyu Zhang MEGVII Technology Beijing, China zhangxiangyu@megvii.com |
| Pseudocode | No | The paper describes the architecture and operations with equations and diagrams but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We pre-train Rev Col V2 on Image Net-1K [25] dataset. To fully exploit the potential of Rev Col V2, we intermediately fine-tune the models on Image Net-22K [39] following [34, 12] after MIM pre-training. For semantic segmentation tasks, we evaluate Rev Col V2 backbones on ADE20K benchmarks [26] with Uper Net [46] and Mask2Former [47] framework. For object detection and instance segmentation task, we evaluate Rev Col V2 backbones on COCO [27] dataset with Mask R-CNN [55] and Cascade Mask R-CNN [56] framework. Except for the additional teacher, we use a larger dataset Laion400M[28], which contains about 400M unlabeled images in pre-training. |
| Dataset Splits | No | The paper mentions using datasets like Image Net-1K, Image Net-22K, ADE20K, and COCO for training and fine-tuning, but it does not explicitly provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, and test sets) beyond implicitly using standard benchmark practices. |
| Hardware Specification | Yes | We test the inference latency of variant models in Table 7. As described Rev Col V1 [1], fragmented memory access takes a large part of latency. We show throughput (#image/s) under the different batch size of Rev Col V2-L and Vi T-L on a single A100 GPU. We test training speed and memory cost on a single A100 (80GB) x 8 machine, with the same data-loader (implemented for our cluster). This allows Rev Col V2 can be pre-trained with limited resources, such as RTX3090 GPU. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and implicitly PyTorch (when discussing "vanilla autograd function in Py Torch"), but it does not specify version numbers for any software, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | The mask ratio is set as 75% with random sampling strategy and the reconstruction target is the normalized raw pixel from the original image. We pre-train 1600 epochs for Rev Col V2 models. The pre-training image size is 2242 and the pre-training optimization parameters are: batch size 4096, base learning rate 1.5e-4 for 256 batch-size and linear scaled up, Adam W with weight decay 0.05. More details can be found in supplementary material. (And various tables B.1 to B.5) |