Scaling White-Box Transformers for Vision

Authors: Jinrui Yang, Xianhang Li, Druv Pai, Yuyin Zhou, Yi Ma, Yaodong Yu, Cihang Xie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Overall. The experimental section consists of three parts: (1) Scaling study: We thoroughly investigate the scaling behaviors of CRATE-α from Base to Large size and ultimately to Huge size. (2) Downstream applications: To further verify the broader benefits of scaling the CRATE-α model, we conduct additional experiments on real-world downstream tasks and present preliminary exploration results of CRATE-α on language tasks. (3) Interpretability: In addition to scalability, we study the interpretability of CRATE-α across different model sizes.
Researcher Affiliation Academia 1UC Santa Cruz 2UC Berkeley
Pseudocode No The paper describes mathematical derivations and iterative processes but does not include a dedicated 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The project page is https://rayjryang.github.io/CRATE-alpha/.
Open Datasets Yes For the transition from Base to Large size, we pre-train our model on Image Net-21K and fine-tune it on Image Net-1K via supervised learning. When scaling from Large to Huge, we utilize the Data Comp1B [10] dataset within a vision-language pre-training paradigm... We include additional experimental results on four downstream datasets (CIFAR-10/100, Oxford Flowers, and Oxford-IIT Pets). We also examine the dense prediction capability of CRATE-α by training it on segmentation tasks using the ADE20K dataset [51]. For language tasks, we conduct new experiments with CRATE-α using autoregressive training on Open Web Text...
Dataset Splits No The paper describes training and fine-tuning procedures on Image Net-21K and Image Net-1K, but does not explicitly specify validation dataset splits with percentages or counts for these vision tasks.
Hardware Specification Yes The pre-training computation for the top-performing model, CRATE-α-L/8, is resource-intensive on Image Net-21K. ... measured in TPU v3 core-hours.; FLOPs and throughput are calculated based on an input size of 224x224 on an NVIDIA RTX A6000 graphics card.
Software Dependencies No The paper mentions 'Adam W optimizer [24]' and 'Adam W [25]', but does not provide specific version numbers for general software dependencies like Python or PyTorch.
Experiment Setup Yes During the pre-training phase, we set the learning rate to 8 10 4, weight decay to 0.1, and batch size to 4096. We apply data augmentation techniques such as Inception crop [35] resized to 224 and random horizontal flipping. In the fine-tuning phase, we adjust the base learning rate to 1.6 10 4, maintain weight decay at 0.1, and batch size at 4096. We apply label smoothing with a smoothing parameter of 0.1 and apply data augmentation methods including Inception crop, random horizontal flipping, and random augmentation with two transformations (magnitude of 9). Also from Table 9: Config Value optimizer Adam W [25] optimizer momentum (0.9, 0.95) batch size 32768 base lr 8e-6 minimal lr 0 warm-up steps 1600 schedule cosine decay [23] weight decay 0.2 random crop area (40, 100) resize method bi-linear temperature init 1/0.07 [13, 19]