SEPT: Towards Scalable and Efficient Visual Pre-training
Authors: Yiqi Lin, Huabin Zheng, Huaping Zhong, Jinjing Zhu, Weijia Li, Conghui He, Lin Wang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results on various downstream tasks demonstrate that SEPT can achieve competitive or even better performance compared with Image Net pretraining while reducing the size of training samples by one magnitude without resorting to any extra annotations. We conduct experiments on seven classification and three detection tasks with limited labeled samples. |
| Researcher Affiliation | Collaboration | Yiqi Lin1*, Huabin Zheng2, Huaping Zhong2, Jinjing Zhu1, Weijia Li3 , Conghui He2, Lin Wang1, 4 1AI Thrust, Information Hub, HKUST (Guangzhou), Guangzhou, China 2Sense Time Research 3Sun Yat-Sen University 4Department of Computer Science and Engineering, HKUST, Hong Kong, China {ylin933, jzhu706}@connect.hkust-gz.edu.cn, {zhenghuabin,zhonghuaping,heconghui}@sensetime.com liweij29@mail.sysu.edu.cn, linwang@ust.hk |
| Pseudocode | Yes | Algorithm 1: Task-specific Instance Search Input: an unlabeled dataset U, a target dataset T , a budget (number of images) of pre-training dataset K and a feature extractor θR well trained on subset ˆU U. Output: a task-specific pre-training subset Dsearch. |
| Open Source Code | No | No concrete statement about providing access to source code (e.g., 'We release our code...', 'Code available at...') or a direct link to a code repository for the methodology described in this paper. |
| Open Datasets | Yes | We combine three large-scale datasets, Image Net-22k(IN22k) (Deng et al. 2009), INTERN (Shao et al. 2021) and YFCC-100m (Thomee et al. 2016), to construct an unlabeled data pool with totally 155 million images, called SEPT-155m. |
| Dataset Splits | No | The paper describes sampling strategies like 'randomly sample 5-shot or 10-shot from each category' or 'randomly sample 1,000 images from the original training set'. However, it does not provide specific train/validation/test dataset splits (exact percentages, sample counts, or citations to predefined splits) to reproduce the data partitioning for all experiments, nor does it define how the data is split for the main training phase beyond the few-shot sampling for finetuning. |
| Hardware Specification | Yes | The self-supervised pre-training follows Mo By (Xie et al. 2021a) in 300 epochs setting with batch size 512 on 8 Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions software components such as 'milvus' and refers to models like 'Vi T-S' and 'Swin-T' and optimizers like 'Adam W', but it does not provide specific version numbers for these software dependencies (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | The self-supervised pre-training follows Mo By (Xie et al. 2021a) in 300 epochs setting with batch size 512 on 8 Tesla V100 GPUs. The pretraining adopts Adam W (Loshchilov and Hutter 2018) with a fixed learning rate of 0.001 and a fixed weight decay of 0.05. The key queue size is set to 4096, the temperature is set to 0.2, and the drop path rate is set to 0.2. All finetuning experiments use the same 100-epoch finetuning setting on single Tesla V100 GPU. In finetuning, we set the batch size to 64 and employ an Adam W optimizer with a base learning rate of 5e-3, weight decay of 0.05, a stochastic depth ratio of 0.1, and a layer-wise learning rate decay of 0.9. We also adopt a cosine learning rate scheduler with 10 epochs warm-up and follow the same data augmentation used in (Xie et al. 2021b). |