reproducibilityindex.ai

White-Box Transformers via Sparse Rate Reduction

Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin Haeffele, Yi Ma

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments to study the performance of our proposed white-box transformer CRATE on real-world datasets and tasks.
Researcher Affiliation	Academia	1University of California, Berkeley 2TTIC 3Johns Hopkins University
Pseudocode	Yes	A Py Torch-style pseudocode can be found in Appendix B.1, which contains more implementation details.
Open Source Code	Yes	Code is at https://github. com/Ma-Lab-Berkeley/CRATE.
Open Datasets	Yes	We mainly consider Image Net-1K [9] as the testbed for our architecture. Specifically, we apply the Lion optimizer [73] to train CRATE models with different model sizes. Meanwhile, we also evaluate the transfer learning performance of CRATE: by considering the models trained on Image Net-1K as pre-trained models, we fine-tune CRATE on several commonly used downstream datasets (CIFAR10/100, Oxford Flowers, Oxford-IIT-Pets).
Dataset Splits	Yes	Specifically, we evaluate these two terms by using training/validation samples from Image Net-1K.
Hardware Specification	Yes	One training epoch of CRATE Base takes around 240 seconds using 16 A100 40GB GPUs.
Software Dependencies	No	The paper mentions software like PyTorch, Lion optimizer, and AdamW optimizer, but it does not specify their version numbers, which is required for reproducibility.
Experiment Setup	Yes	We configure the learning rate as 2.4 10 4, weight decay as 0.5, and batch size as 2,048. We incorporate a warm-up strategy with a linear increase over 5 epochs, followed by training the models for a total of 150 epochs with cosine decay. For data augmentation, we only apply the standard techniques, random cropping and random horizontal flipping, on the Image Net-1K dataset. We apply label smoothing with smoothing parameter 0.1.