Position Prediction as an Effective Pretraining Strategy
Authors: Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Y Cheng, Walter Talbott, Chen Huang, Hanlin Goh, Joshua M Susskind
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on both Vision and Speech benchmarks, where our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods. |
| Researcher Affiliation | Industry | 1Apple Inc. Correspondence to: Shuangfei Zhai <szhai@apple.com> |
| Pseudocode | Yes | Algorithm 1 Pseudo code of MP3 in a Py Torch-like style, where we ignore the cls token for simplicty. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing its own source code, nor does it provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | CIFAR-100 (Krizhevsky et al., 2009), Tiny Image Net http://cs231n.stanford.edu/tiny-imagenet-200.zip, Image Net-1K (Deng et al., 2009), Google Speech Commands dataset v1 (Warden, 2018) |
| Dataset Splits | Yes | We use the Google Speech Commands dataset v1 (Warden, 2018) and implemented our models using the publicly available implementation of TC-Res Net (Choi et al., 2019) 4, keeping their audio preprocessing routines, data splits and other details intact. and The finetuning phase follows exactly the same protocol as the supervised training recipes suggested in (Touvron et al., 2021). |
| Hardware Specification | Yes | In Table 2 we report the training time (seconds per iteration) and the memory consumption (in gigabytes) on Image Net-1K with a single A100 GPU. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and various augmentation techniques (Rand Augment, Cut Mix, Mix Up, Random Erasing, Repeated Augmentation), and pseudocode is in a "Py Torch-like style", but it does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | For CIFAR-100...uses Adam W (Loshchilov & Hutter, 2017) optimizer, weight decay of 0.05, drop path (Ghiasi et al., 2018) rate of 0.1...We search for optimal η for each dataset in the pretraining phase, which is 0.5, 0.8, 0.75 for CIFAR-100, Tiny Image Net and Image Net-1K, respectively. The batch size is 256, 512 and 2048, respectively. and Optimization is done with Adam (Kingma & Ba, 2015) with a batch size of 256 and early stopping is done based on validation accuracy. Warmup of learning is done for 500 updates with a constant learning rate of 10 4. Subsequently the learning rate is increased to 10 3 and dropped by a factor of 2 every 10k updates. For supervised baselines and finetuning phase we also use label smoothing (=0.1) for regularization and we train the models for 30K updates. |