Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Compute Better Spent: Replacing Dense Layers with Structured Matrices
Authors: Shikai Qiu, Andres Potapczynski, Marc Anton Finzi, Micah Goldblum, Andrew Gordon Wilson
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We measure the scaling laws of different structures to compare how quickly their performance improves with compute. On CIFAR10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and Vi Ts. BTT matches dense Vi TS/32 performance on Image Net-1k with 3.8 times less compute and is more efficient than dense for training small GPT-2 language models. |
| Researcher Affiliation | Academia | 1New York University 2Carnegie Mellon University. |
| Pseudocode | No | The paper describes algorithms and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | We make our code available available here. |
| Open Datasets | Yes | On CIFAR10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and Vi Ts. On Image Net-1k, BTT matches dense Vi T-S/32 performance with 3.8 times less compute. We train GPT-2 models on Open Web Text |
| Dataset Splits | Yes | We use CIFAR-10 and CIFAR-100 datasets, applying random crop, random flip, Mix Up (αmixup = 0.8) augmentations, and label smoothing of 0.3. We train with a global batch size of 3072 for 300 epochs with random crops, horizontal flip, random augmentations (rand-m9-mstd0.5-inc1 from the timm library (Wightman, 2019)), and Mixup of 0.2. We train GPT-2 models on Open Web Text for 600, 000 steps with a batch size of 245, 760 tokens at a sequence length of 512. |
| Hardware Specification | Yes | We verify this on an Nvidia A100 GPU in Figure 10a. |
| Software Dependencies | No | The paper mentions software like 'Co LA', 'Pytorch image models' (timm library), 'Adam', and 'AdamW', but does not specify their version numbers. |
| Experiment Setup | Yes | We train MLPs for 500 epochs with batch size of 1024, and Vi Ts for 200 epochs with batch size of 256. We use a base learning rate of η0 = 3e 3 for a dense MLP at d0 = 64, and η0 = 1e 3 for a dense Vi T at d0 = 64. We train with a global batch size of 3072 for 300 epochs with random crops, horizontal flip, random augmentations (rand-m9-mstd0.5-inc1 from the timm library (Wightman, 2019)), and Mixup of 0.2. We train with a global batch size of 480 and a context length of 512 for 600,000 steps. |