The Quantization Model of Neural Scaling
Authors: Eric Michaud, Ziming Liu, Uzay Girit, Max Tegmark
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). |
| Researcher Affiliation | Academia | Eric J. Michaud , Ziming Liu, Uzay Girit, and Max Tegmark MIT & IAIFI |
| Pseudocode | No | The paper describes a method named 'Quanta Discovery from Gradients (QDG)' with sequential steps in paragraph format, but it does not present it as a formally structured pseudocode or algorithm block. |
| Open Source Code | Yes | Project code can be found at: https://github.com/ejmichaud/quantization-model. |
| Open Datasets | Yes | For our experiments, we use the Pythia model suite from Eleuther AI [29], a set of decoder-only transformers of varying size trained on approximately 300 billion tokens of The Pile [30]. |
| Dataset Splits | No | The paper mentions training and testing on datasets but does not provide specific percentages or counts for training/validation/test splits, nor does it explicitly detail predefined splits with citations for reproducibility. |
| Hardware Specification | Yes | Availble GPUs include NVIDIA A100, RTXA6000, QUADRORTX6000, GEFORCERTX2080TI, GEFORCERTX2080, GEFORCEGTX1080TI, titan-x, and tesla-v100. |
| Software Dependencies | No | The paper mentions using 'scikit-learn [45]' but does not provide specific version numbers for this or any other software dependencies crucial for replication. |
| Experiment Setup | Yes | We use the Adam optimizer with a learning rate of 10-3. To study scaling with respect to the number of model parameters, we train networks of varying width by sampling batches online... For the results shown, we used ntasks = 500, n = 100, k = 3, α = 0.4, and a batch size of 20000. We vary training dataset size from 1e4 to 5e6 and vary hidden-layer width from 10 to 500 neurons. We train for 2e5 steps. |