Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling White-Box Transformers for Vision

Authors: Jinrui Yang, Xianhang Li, Druv Pai, Yuyin Zhou, Yi Ma, Yaodong Yu, Cihang Xie

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Overall. The experimental section consists of three parts: (1) Scaling study: We thoroughly investigate the scaling behaviors of CRATE-α from Base to Large size and ultimately to Huge size. (2) Downstream applications: To further verify the broader benefits of scaling the CRATE-α model, we conduct additional experiments on real-world downstream tasks and present preliminary exploration results of CRATE-α on language tasks. (3) Interpretability: In addition to scalability, we study the interpretability of CRATE-α across different model sizes.
Researcher Affiliation Academia 1UC Santa Cruz 2UC Berkeley
Pseudocode No The paper describes mathematical derivations and iterative processes but does not include a dedicated 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The project page is https://rayjryang.github.io/CRATE-alpha/.
Open Datasets Yes For the transition from Base to Large size, we pre-train our model on Image Net-21K and fine-tune it on Image Net-1K via supervised learning. When scaling from Large to Huge, we utilize the Data Comp1B [10] dataset within a vision-language pre-training paradigm... We include additional experimental results on four downstream datasets (CIFAR-10/100, Oxford Flowers, and Oxford-IIT Pets). We also examine the dense prediction capability of CRATE-α by training it on segmentation tasks using the ADE20K dataset [51]. For language tasks, we conduct new experiments with CRATE-α using autoregressive training on Open Web Text...
Dataset Splits No The paper describes training and fine-tuning procedures on Image Net-21K and Image Net-1K, but does not explicitly specify validation dataset splits with percentages or counts for these vision tasks.
Hardware Specification Yes The pre-training computation for the top-performing model, CRATE-α-L/8, is resource-intensive on Image Net-21K. ... measured in TPU v3 core-hours.; FLOPs and throughput are calculated based on an input size of 224x224 on an NVIDIA RTX A6000 graphics card.
Software Dependencies No The paper mentions 'Adam W optimizer [24]' and 'Adam W [25]', but does not provide specific version numbers for general software dependencies like Python or PyTorch.
Experiment Setup Yes During the pre-training phase, we set the learning rate to 8 10 4, weight decay to 0.1, and batch size to 4096. We apply data augmentation techniques such as Inception crop [35] resized to 224 and random horizontal flipping. In the fine-tuning phase, we adjust the base learning rate to 1.6 10 4, maintain weight decay at 0.1, and batch size at 4096. We apply label smoothing with a smoothing parameter of 0.1 and apply data augmentation methods including Inception crop, random horizontal flipping, and random augmentation with two transformations (magnitude of 9). Also from Table 9: Config Value optimizer Adam W [25] optimizer momentum (0.9, 0.95) batch size 32768 base lr 8e-6 minimal lr 0 warm-up steps 1600 schedule cosine decay [23] weight decay 0.2 random crop area (40, 100) resize method bi-linear temperature init 1/0.07 [13, 19]