Compute Better Spent: Replacing Dense Layers with Structured Matrices
Authors: Shikai Qiu, Andres Potapczynski, Marc Anton Finzi, Micah Goldblum, Andrew Gordon Wilson
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We measure the scaling laws of different structures to compare how quickly their performance improves with compute. On CIFAR10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and Vi Ts. BTT matches dense Vi TS/32 performance on Image Net-1k with 3.8 times less compute and is more efficient than dense for training small GPT-2 language models. |
| Researcher Affiliation | Academia | 1New York University 2Carnegie Mellon University. |
| Pseudocode | No | The paper describes algorithms and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | We make our code available available here. |
| Open Datasets | Yes | On CIFAR10/100 with augmentation, BTT achieves exponentially lower training loss than dense when training MLPs and Vi Ts. On Image Net-1k, BTT matches dense Vi T-S/32 performance with 3.8 times less compute. We train GPT-2 models on Open Web Text |
| Dataset Splits | Yes | We use CIFAR-10 and CIFAR-100 datasets, applying random crop, random flip, Mix Up (αmixup = 0.8) augmentations, and label smoothing of 0.3. We train with a global batch size of 3072 for 300 epochs with random crops, horizontal flip, random augmentations (rand-m9-mstd0.5-inc1 from the timm library (Wightman, 2019)), and Mixup of 0.2. We train GPT-2 models on Open Web Text for 600, 000 steps with a batch size of 245, 760 tokens at a sequence length of 512. |
| Hardware Specification | Yes | We verify this on an Nvidia A100 GPU in Figure 10a. |
| Software Dependencies | No | The paper mentions software like 'Co LA', 'Pytorch image models' (timm library), 'Adam', and 'AdamW', but does not specify their version numbers. |
| Experiment Setup | Yes | We train MLPs for 500 epochs with batch size of 1024, and Vi Ts for 200 epochs with batch size of 256. We use a base learning rate of η0 = 3e 3 for a dense MLP at d0 = 64, and η0 = 1e 3 for a dense Vi T at d0 = 64. We train with a global batch size of 3072 for 300 epochs with random crops, horizontal flip, random augmentations (rand-m9-mstd0.5-inc1 from the timm library (Wightman, 2019)), and Mixup of 0.2. We train with a global batch size of 480 and a context length of 512 for 600,000 steps. |