Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners

Authors: Bowen Shi, XIAOPENG ZHANG, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Hybrid Distill achieves superior performance on various benchmark datasets.
Researcher Affiliation Collaboration Bowen Shi1 Xiaopeng Zhang2 Yaoming Wang1 Jin Li1 Wenrui Dai1 Junni Zou1 Hongkai Xiong1 Qi Tian2 1Shanghai Jiao Tong University 2Huawei Inc.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/lygsbw/hybriddistill.
Open Datasets Yes The input size is 2242. For Vi T-B, the distillation is based on Image Net-1K Russakovsky et al. (2015), and the epoch is 300 for main results and 100 for ablation studies. For Vi T-L, we conduct 300 epoch distillation based on Image Net-1K and 40 epoch distillation based on Image Net-21K, respectively. The performances are tested on different downstream tasks, including Image Net-1K, CIFAR100 (Krizhevsky et al., 2009), Cars (Krause et al., 2013), and i Naturalist19 (Van Horn et al., 2018) classification, COCO (Lin et al., 2014) object detection and instance segmentation, and ADE20K (Zhou et al., 2019) segmentation.
Dataset Splits No The paper mentions using standard datasets like Image Net-1K, COCO, and ADE20K for training and evaluation, but does not explicitly provide the specific training/validation/test dataset splits used for reproduction, nor does it refer to predefined splits with specific citations regarding the splits themselves.
Hardware Specification Yes Our experiments are conducted on 8 V100 GPUs.
Software Dependencies No The paper mentions the use of Adam W optimizer, ViT, and Mask-RCNN frameworks but does not provide specific version numbers for programming languages or libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes The batch size, learning rate, and weight decay are set to 1024, 6e-4, and 0.05, respectively. Adam W (Loshchilov & Hutter, 2017) optimizer and cosine decay (Loshchilov & Hutter, 2016) schedule is used. The input size is 2242. For Vi T-B, the distillation is based on Image Net-1K Russakovsky et al. (2015), and the epoch is 300 for main results and 100 for ablation studies. For Vi T-L, we conduct 300 epoch distillation based on Image Net-1K and 40 epoch distillation based on Image Net-21K, respectively. The hyperparameter α and β are set to 1.0 and the redundant token masking set I is set to [0, L/3, 2L/3] following Li et al. (2023).