How Well Does Self-Supervised Pre-Training Perform with Streaming Data?

Authors: Dapeng Hu, Shipeng Yan, Qizhengqiu Lu, Lanqing HONG, Hailin Hu, Yifan Zhang, Zhenguo Li, Xinchao Wang, Jiashi Feng

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we conduct the first thorough and dedicated investigation on self-supervised pretraining with streaming data, aiming to shed light on the model behavior under this overlooked setup. Specifically, we pre-train over 500 models on four categories of pre-training streaming data from Image Net and Domain Net and evaluate them on three types of downstream tasks and 12 different downstream datasets.
Researcher Affiliation Collaboration 1National University of Singapore 2Shanghai Tech University 3AARC, Huawei Technologies 4Huawei Noah s Ark Lab
Pseudocode No The paper describes methods with formulas and prose but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper states, "The implementation is based on Open Self Sup1," with a footnote linking to the Open Self Sup GitHub repository. However, it does not explicitly state that the authors' specific code or modifications for the described methodology are open-source or provided via this link.
Open Datasets Yes For the instance incremental sequence, we split the Image Net (Russakovsky et al., 2015) training data... The first three types of streaming data are designed with Image Net (Russakovsky et al., 2015), while the domain incremental sequence consists of five domains in Domain Net (Peng et al., 2019).
Dataset Splits No The paper mentions tuning the weight decay value for linear evaluation (Table 5), which implies the use of a validation set. However, it does not explicitly provide specific percentages, sample counts, or references to predefined validation splits for any of the datasets used in the experiments.
Hardware Specification No The paper states that statistics are "recorded under the same hardware environment" in Table 1, but it does not provide any specific details about the hardware used (e.g., GPU models, CPU models, memory specifications).
Software Dependencies No The paper mentions using "Mo Co-v2 (Chen et al., 2020c)" and states that "The implementation is based on Open Self Sup". While software names are given, specific version numbers for these or other dependencies are not provided.
Experiment Setup Yes For both joint training and sequential training, the number of training epochs is 200 for each model training... the queue size is 65,536... The current mini-chunk features from the key encoder are enqueued and the same number of oldest features are dequeued... The training batch size is 256... the regularization coefficient λ is fixed to be 100... Specifically, we set the number of nearest neighbor k=200 for the KNN classification.