White-Box Transformers via Sparse Rate Reduction
Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin Haeffele, Yi Ma
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments to study the performance of our proposed white-box transformer CRATE on real-world datasets and tasks. |
| Researcher Affiliation | Academia | 1University of California, Berkeley 2TTIC 3Johns Hopkins University |
| Pseudocode | Yes | A Py Torch-style pseudocode can be found in Appendix B.1, which contains more implementation details. |
| Open Source Code | Yes | Code is at https://github. com/Ma-Lab-Berkeley/CRATE. |
| Open Datasets | Yes | We mainly consider Image Net-1K [9] as the testbed for our architecture. Specifically, we apply the Lion optimizer [73] to train CRATE models with different model sizes. Meanwhile, we also evaluate the transfer learning performance of CRATE: by considering the models trained on Image Net-1K as pre-trained models, we fine-tune CRATE on several commonly used downstream datasets (CIFAR10/100, Oxford Flowers, Oxford-IIT-Pets). |
| Dataset Splits | Yes | Specifically, we evaluate these two terms by using training/validation samples from Image Net-1K. |
| Hardware Specification | Yes | One training epoch of CRATE Base takes around 240 seconds using 16 A100 40GB GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch, Lion optimizer, and AdamW optimizer, but it does not specify their version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | We configure the learning rate as 2.4 10 4, weight decay as 0.5, and batch size as 2,048. We incorporate a warm-up strategy with a linear increase over 5 epochs, followed by training the models for a total of 150 epochs with cosine decay. For data augmentation, we only apply the standard techniques, random cropping and random horizontal flipping, on the Image Net-1K dataset. We apply label smoothing with smoothing parameter 0.1. |