SEPT: Towards Scalable and Efficient Visual Pre-training

Authors: Yiqi Lin, Huabin Zheng, Huaping Zhong, Jinjing Zhu, Weijia Li, Conghui He, Lin Wang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on various downstream tasks demonstrate that SEPT can achieve competitive or even better performance compared with Image Net pretraining while reducing the size of training samples by one magnitude without resorting to any extra annotations. We conduct experiments on seven classification and three detection tasks with limited labeled samples.
Researcher Affiliation Collaboration Yiqi Lin1*, Huabin Zheng2, Huaping Zhong2, Jinjing Zhu1, Weijia Li3 , Conghui He2, Lin Wang1, 4 1AI Thrust, Information Hub, HKUST (Guangzhou), Guangzhou, China 2Sense Time Research 3Sun Yat-Sen University 4Department of Computer Science and Engineering, HKUST, Hong Kong, China {ylin933, jzhu706}@connect.hkust-gz.edu.cn, {zhenghuabin,zhonghuaping,heconghui}@sensetime.com liweij29@mail.sysu.edu.cn, linwang@ust.hk
Pseudocode Yes Algorithm 1: Task-specific Instance Search Input: an unlabeled dataset U, a target dataset T , a budget (number of images) of pre-training dataset K and a feature extractor θR well trained on subset ˆU U. Output: a task-specific pre-training subset Dsearch.
Open Source Code No No concrete statement about providing access to source code (e.g., 'We release our code...', 'Code available at...') or a direct link to a code repository for the methodology described in this paper.
Open Datasets Yes We combine three large-scale datasets, Image Net-22k(IN22k) (Deng et al. 2009), INTERN (Shao et al. 2021) and YFCC-100m (Thomee et al. 2016), to construct an unlabeled data pool with totally 155 million images, called SEPT-155m.
Dataset Splits No The paper describes sampling strategies like 'randomly sample 5-shot or 10-shot from each category' or 'randomly sample 1,000 images from the original training set'. However, it does not provide specific train/validation/test dataset splits (exact percentages, sample counts, or citations to predefined splits) to reproduce the data partitioning for all experiments, nor does it define how the data is split for the main training phase beyond the few-shot sampling for finetuning.
Hardware Specification Yes The self-supervised pre-training follows Mo By (Xie et al. 2021a) in 300 epochs setting with batch size 512 on 8 Tesla V100 GPUs.
Software Dependencies No The paper mentions software components such as 'milvus' and refers to models like 'Vi T-S' and 'Swin-T' and optimizers like 'Adam W', but it does not provide specific version numbers for these software dependencies (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9).
Experiment Setup Yes The self-supervised pre-training follows Mo By (Xie et al. 2021a) in 300 epochs setting with batch size 512 on 8 Tesla V100 GPUs. The pretraining adopts Adam W (Loshchilov and Hutter 2018) with a fixed learning rate of 0.001 and a fixed weight decay of 0.05. The key queue size is set to 4096, the temperature is set to 0.2, and the drop path rate is set to 0.2. All finetuning experiments use the same 100-epoch finetuning setting on single Tesla V100 GPU. In finetuning, we set the batch size to 64 and employ an Adam W optimizer with a base learning rate of 5e-3, weight decay of 0.05, a stochastic depth ratio of 0.1, and a layer-wise learning rate decay of 0.9. We also adopt a cosine learning rate scheduler with 10 epochs warm-up and follow the same data augmentation used in (Xie et al. 2021b).