reproducibilityindex.ai

How Well Does Self-Supervised Pre-Training Perform with Streaming Data?

Authors: Dapeng Hu, Shipeng Yan, Qizhengqiu Lu, Lanqing HONG, Hailin Hu, Yifan Zhang, Zhenguo Li, Xinchao Wang, Jiashi Feng

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we conduct the ﬁrst thorough and dedicated investigation on self-supervised pretraining with streaming data, aiming to shed light on the model behavior under this overlooked setup. Speciﬁcally, we pre-train over 500 models on four categories of pre-training streaming data from Image Net and Domain Net and evaluate them on three types of downstream tasks and 12 different downstream datasets.
Researcher Affiliation	Collaboration	1National University of Singapore 2Shanghai Tech University 3AARC, Huawei Technologies 4Huawei Noah s Ark Lab
Pseudocode	No	The paper describes methods with formulas and prose but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states, "The implementation is based on Open Self Sup1," with a footnote linking to the Open Self Sup GitHub repository. However, it does not explicitly state that the authors' specific code or modifications for the described methodology are open-source or provided via this link.
Open Datasets	Yes	For the instance incremental sequence, we split the Image Net (Russakovsky et al., 2015) training data... The ﬁrst three types of streaming data are designed with Image Net (Russakovsky et al., 2015), while the domain incremental sequence consists of ﬁve domains in Domain Net (Peng et al., 2019).
Dataset Splits	No	The paper mentions tuning the weight decay value for linear evaluation (Table 5), which implies the use of a validation set. However, it does not explicitly provide specific percentages, sample counts, or references to predefined validation splits for any of the datasets used in the experiments.
Hardware Specification	No	The paper states that statistics are "recorded under the same hardware environment" in Table 1, but it does not provide any specific details about the hardware used (e.g., GPU models, CPU models, memory specifications).
Software Dependencies	No	The paper mentions using "Mo Co-v2 (Chen et al., 2020c)" and states that "The implementation is based on Open Self Sup". While software names are given, specific version numbers for these or other dependencies are not provided.
Experiment Setup	Yes	For both joint training and sequential training, the number of training epochs is 200 for each model training... the queue size is 65,536... The current mini-chunk features from the key encoder are enqueued and the same number of oldest features are dequeued... The training batch size is 256... the regularization coefﬁcient λ is ﬁxed to be 100... Speciﬁcally, we set the number of nearest neighbor k=200 for the KNN classiﬁcation.