reproducibilityindex.ai

Position Prediction as an Effective Pretraining Strategy

Authors: Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Y Cheng, Walter Talbott, Chen Huang, Hanlin Goh, Joshua M Susskind

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment on both Vision and Speech benchmarks, where our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
Researcher Affiliation	Industry	1Apple Inc. Correspondence to: Shuangfei Zhai <szhai@apple.com>
Pseudocode	Yes	Algorithm 1 Pseudo code of MP3 in a Py Torch-like style, where we ignore the cls token for simplicty.
Open Source Code	No	The paper does not contain any explicit statement about releasing its own source code, nor does it provide a link to a code repository for the methodology described.
Open Datasets	Yes	CIFAR-100 (Krizhevsky et al., 2009), Tiny Image Net http://cs231n.stanford.edu/tiny-imagenet-200.zip, Image Net-1K (Deng et al., 2009), Google Speech Commands dataset v1 (Warden, 2018)
Dataset Splits	Yes	We use the Google Speech Commands dataset v1 (Warden, 2018) and implemented our models using the publicly available implementation of TC-Res Net (Choi et al., 2019) 4, keeping their audio preprocessing routines, data splits and other details intact. and The finetuning phase follows exactly the same protocol as the supervised training recipes suggested in (Touvron et al., 2021).
Hardware Specification	Yes	In Table 2 we report the training time (seconds per iteration) and the memory consumption (in gigabytes) on Image Net-1K with a single A100 GPU.
Software Dependencies	No	The paper mentions using Adam W optimizer and various augmentation techniques (Rand Augment, Cut Mix, Mix Up, Random Erasing, Repeated Augmentation), and pseudocode is in a "Py Torch-like style", but it does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	For CIFAR-100...uses Adam W (Loshchilov & Hutter, 2017) optimizer, weight decay of 0.05, drop path (Ghiasi et al., 2018) rate of 0.1...We search for optimal η for each dataset in the pretraining phase, which is 0.5, 0.8, 0.75 for CIFAR-100, Tiny Image Net and Image Net-1K, respectively. The batch size is 256, 512 and 2048, respectively. and Optimization is done with Adam (Kingma & Ba, 2015) with a batch size of 256 and early stopping is done based on validation accuracy. Warmup of learning is done for 500 updates with a constant learning rate of 10 4. Subsequently the learning rate is increased to 10 3 and dropped by a factor of 2 every 10k updates. For supervised baselines and finetuning phase we also use label smoothing (=0.1) for regularization and we train the models for 30K updates.