Adaptive Computation with Elastic Input Sequence

Authors: Fuzhao Xue, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, Yang You

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on image recognition tasks, we show that Ada Tape can achieve better performance while maintaining the computational cost. To facilitate further research, we have released code at https://github.com/google-research/scenic.
Researcher Affiliation Collaboration Fuzhao Xue 1 2 Valerii Likhosherstov 1 Anurag Arnab 1 Neil Houlsby 1 Mostafa Dehghani 1 Yang You 2 1Google Brain 2National University of Singapore.
Pseudocode Yes Algorithm 1 Adaptive Computation Time; Algorithm 2 Adaptive Tape Reading
Open Source Code Yes To facilitate further research, we have released code at https://github.com/google-research/scenic.
Open Datasets Yes We pre-train Ada Tape on JFT-300M (Sun et al., 2017) followed by few-shot learning on a wide range of datasets, including Image Net (Deng et al., 2009), Cifar100 (Krizhevsky et al., 2009) and Pets (Parkhi et al., 2012)
Dataset Splits Yes The pre-training is conducted on JFT-300M dataset and we report precision@1 (%) on the validation dataset. The few-shot experiments are on Image Net, Cifar100, and Pets datasets with Top-1 accuracy. IN25 denotes the result on Image Net 25-shot.
Hardware Specification Yes This is the limitation of Ada Tape. However, please note this only means the training speed is slightly slower on TPU, which is highly optimized by large-scale matrix multiplication. For example, training Vi T-B/8 with data parallelism on 512 TPUv3 cores has the OOM issue.
Software Dependencies No The paper does not provide specific version numbers for software dependencies. It mentions using an optimizer (Adam W) and techniques like Mixup, Rand Aug, and label smoothing, but no explicit software versions for libraries or frameworks like TensorFlow or PyTorch.
Experiment Setup Yes We train all models for 10000 steps with batch size 128. The learning rate is set as 3e-5, and we use a linear warm-up for 1000 steps. Customized hyper-parameters for Ada Tape are summarized in Table 4. We employed the fixed max ponder times for all models. Smaller τ on Ada Tape-L with an input-driven bank. The bank size is 10000 for Ada Tape-learn. We use bank size 784 for Ada Tape Input as we set patch size as 8 to generate tokens from images with 224 224 resolution. Ada Tape with a learnable bank can be trained without halting loss. Also, note that we append tape tokens after first transformer encoder layer for better query quality and tape selection. For Image Net training from scratch, we summarized the data augmentation and corresponding hyper-parameters in Table 5.