White-Box Transformers via Sparse Rate Reduction

Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin Haeffele, Yi Ma

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments to study the performance of our proposed white-box transformer CRATE on real-world datasets and tasks.
Researcher Affiliation Academia 1University of California, Berkeley 2TTIC 3Johns Hopkins University
Pseudocode Yes A Py Torch-style pseudocode can be found in Appendix B.1, which contains more implementation details.
Open Source Code Yes Code is at https://github. com/Ma-Lab-Berkeley/CRATE.
Open Datasets Yes We mainly consider Image Net-1K [9] as the testbed for our architecture. Specifically, we apply the Lion optimizer [73] to train CRATE models with different model sizes. Meanwhile, we also evaluate the transfer learning performance of CRATE: by considering the models trained on Image Net-1K as pre-trained models, we fine-tune CRATE on several commonly used downstream datasets (CIFAR10/100, Oxford Flowers, Oxford-IIT-Pets).
Dataset Splits Yes Specifically, we evaluate these two terms by using training/validation samples from Image Net-1K.
Hardware Specification Yes One training epoch of CRATE Base takes around 240 seconds using 16 A100 40GB GPUs.
Software Dependencies No The paper mentions software like PyTorch, Lion optimizer, and AdamW optimizer, but it does not specify their version numbers, which is required for reproducibility.
Experiment Setup Yes We configure the learning rate as 2.4 10 4, weight decay as 0.5, and batch size as 2,048. We incorporate a warm-up strategy with a linear increase over 5 epochs, followed by training the models for a total of 150 epochs with cosine decay. For data augmentation, we only apply the standard techniques, random cropping and random horizontal flipping, on the Image Net-1K dataset. We apply label smoothing with smoothing parameter 0.1.