Compute Better Spent: Replacing Dense Layers with Structured Matrices

Authors: Shikai Qiu, Andres Potapczynski, Marc Anton Finzi, Micah Goldblum, Andrew Gordon Wilson

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We measure the scaling laws of different structures to compare how quickly their performance improves with compute. On CIFAR10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and Vi Ts. BTT matches dense Vi TS/32 performance on Image Net-1k with 3.8 times less compute and is more efficient than dense for training small GPT-2 language models.
Researcher Affiliation Academia 1New York University 2Carnegie Mellon University.
Pseudocode No The paper describes algorithms and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes We make our code available available here.
Open Datasets Yes On CIFAR10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and Vi Ts. On Image Net-1k, BTT matches dense Vi T-S/32 performance with 3.8 times less compute. We train GPT-2 models on Open Web Text
Dataset Splits Yes We use CIFAR-10 and CIFAR-100 datasets, applying random crop, random flip, Mix Up (αmixup = 0.8) augmentations, and label smoothing of 0.3. We train with a global batch size of 3072 for 300 epochs with random crops, horizontal flip, random augmentations (rand-m9-mstd0.5-inc1 from the timm library (Wightman, 2019)), and Mixup of 0.2. We train GPT-2 models on Open Web Text for 600, 000 steps with a batch size of 245, 760 tokens at a sequence length of 512.
Hardware Specification Yes We verify this on an Nvidia A100 GPU in Figure 10a.
Software Dependencies No The paper mentions software like 'Co LA', 'Pytorch image models' (timm library), 'Adam', and 'AdamW', but does not specify their version numbers.
Experiment Setup Yes We train MLPs for 500 epochs with batch size of 1024, and Vi Ts for 200 epochs with batch size of 256. We use a base learning rate of η0 = 3e 3 for a dense MLP at d0 = 64, and η0 = 1e 3 for a dense Vi T at d0 = 64. We train with a global batch size of 3072 for 300 epochs with random crops, horizontal flip, random augmentations (rand-m9-mstd0.5-inc1 from the timm library (Wightman, 2019)), and Mixup of 0.2. We train with a global batch size of 480 and a context length of 512 for 600,000 steps.