Exploring Target Representations for Masked Autoencoders

Authors: xingbin liu, Jinghao Zhou, Tao Kong, Xianming Lin, Rongrong Ji

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we show that a careful choice of the target representation is unnecessary for learning good visual representation. Driven by this observation, we propose a multi-stage masked distillation pipeline and use a randomly initialized model as the teacher, enabling us to effectively train high-capacity models without any effort to carefully design the target representation. On various downstream tasks of classification, transfer learning, object detection, and semantic segmentation, the proposed method to perform masked knowledge distillation with bootstrapped teachers (d BOT) outperforms previous self-supervised methods by nontrivial margins. We hope our findings, as well as the proposed method, could motivate people to rethink the roles of target representations in pre-training masked autoencoders.
Researcher Affiliation Collaboration Xingbin Liu1,2 Jinghao Zhou2 Tao Kong2 Xianming Lin1 Rongrong Ji1 1Xiamen University 2Byte Dance
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code and pre-trained models are publicly available at https://github.com/liuxingbin/dbot.
Open Datasets Yes We pre-train models on Image Net-1K (Deng et al., 2009) and conduct evaluation under classification on Image Net, object detection on COCO (Lin et al., 2014), and semantic segmentation on ADE20K (Zhou et al., 2017).
Dataset Splits Yes We primarily focus on the end-to-end fine-tuning performance and report the top-1 validation accuracy on Image Net-1K (Deng et al., 2009) dataset.
Hardware Specification Yes All entries are tested on the same setting, i.e., with 32 NVIDIA A100-80G GPUs.
Software Dependencies No The paper mentions software components such as 'AdamW optimizer' and 'Smooth L1 loss', and implies the use of a deep learning framework like PyTorch and CUDA (given GPU usage), but it does not provide specific version numbers for these software dependencies (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x).
Experiment Setup Yes The learning rate is first linearly increased to the initial learning rate for the first 40 epochs and then cosine annealed to 0. The initial learning rate is set as 1.5e-4 batch size / 256, with batch size being 4096 for all models. We use the Adam W optimizer (Loshchilov & Hutter, 2019) and Smooth L1 loss (Girshick, 2015) to optimize the parameters of student network. Stochastic drop rate are applied, 0.2 for Vi T-B, 0.2 for Vi T-L, and 0.3 for Vi T-H. ... By default, we pre-train all models for classification with 2 stages, for object detection and semantic segmentation with 3 stages.