Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Pay Attention to Small Weights
Authors: chao zhou, Tom Jacobs, Advait Gadhikar, Rebekka Burkholz
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate this for both NLP and vision tasks. [...] We evaluate NANOADAM across a range of NLP and vision tasks, demonstrating superior memory efficiency and generalization compared to baselines such as Micro Adam, Adam W-8bit, and Ga Lore. Notably, the efficiency benefits become more pronounced at larger scales. Furthermore, NANOADAM significantly reduces performance degradation on previously learned tasks during continual learning, effectively alleviating catastrophic forgetting. [...] Experiments are conducted on a compute node equipped with 4 A100 40GB GPUs. The overall performance results are summarized in Table 2 and 3, while details on peak memory usage, training time, and training dynamics are deferred to Appendix H.2. |
| Researcher Affiliation | Academia | Chao Zhou Tom Jacobs Advait Gadhikar Rebekka Burkholz CISPA Helmholtz Center for Information Security, Saarbrücken, Germany EMAIL |
| Pseudocode | Yes | Algorithm 1 NANOADAM Require: initial density k0, mask interval m, density interval d, total steps T, β1, β2 1: m0, v0, I, k 0, 0, 0, k0 2: for t = 0 to T do 3: flagk False 4: if t%d == 0 then 5: k density schedule(k, t, T) 6: flagk True 7: end if 8: if t%m == 0 or flagk == True then 9: I Bottomk(|θt|) 10: end if 10: gt θf(θt)[I] 10: mt momentum update(mt 1, gt, β1) 10: vt momentum update(vt 1, gt, β2) 10: θt+1 θt ηt mt vt+ϵ 11: end for |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code in a zip file. |
| Open Datasets | Yes | Specifically, we fully finetune the BERT-base model on the Co LA dataset from the GLUE benchmark and track the evolution of this relationship for each parameter throughout training. We also conduct a similar experiment in the vision domain, where a Vi T-Large model pretrained on Image Net is finetuned on the CIFAR-10 dataset. [...] For NLP, we evaluate three language models of varying scale: BERT-Base (110M parameters), BERT-Large (335M) [8], and OPT-1.3B [42]. These models are finetuned across multiple tasks from the GLUE benchmark. For CV, we examine two aspects: catastrophic forgetting and parameter shift. Specifically, we finetune a Vi T-Large [36], Res Net101, and Res Net18 [13], all pretrained on Image Net [6]. Each model is first finetuned on CIFAR-10 [18], followed by continued finetuning on the Flowers dataset [25]. [...] We now validate the effectiveness of various optimization methods on a finetuning task. Specifically, we finetune LLa MA2-7B on the GSM-8k dataset, a challenging benchmark for grade-school-level mathematical reasoning. [...] We evaluate our method, NANOADAM, alongside MICROADAM and Adam W on fully fine-tuning the Llama 3.2 3B model on the Commonsense Benchmark. [...] a Res Net-18 model pretrained on Image Net (a general-domain dataset) on the Path MNIST task from the Med MNIST dataset. |
| Dataset Splits | Yes | Each model is first finetuned on CIFAR-10 (Task 1) for a fixed number of epochs, followed by continued finetuning on the Flowers102 (Task 2). [...] We largely follow the hyperparameter settings established by [24] for finetuning on various GLUE tasks. [...] The model is trained for 3 epochs with a global batch size of 32. The micro-batch size per device is set to auto, and the maximum input sequence length is 512. To ensure robustness, we run experiments across four random seeds: {7, 42, 100, 512}. |
| Hardware Specification | Yes | Experiments are conducted on a compute node equipped with 4 A100 40GB GPUs. [...] All models are trained on a compute node equipped with 4 A100 40GB GPUs. |
| Software Dependencies | No | The paper mentions "Transformer models from the Hugging Face library" but does not specify any version numbers for this or other software components. No other specific software with version numbers is listed. |
| Experiment Setup | Yes | We largely follow the hyperparameter settings established by [24] for finetuning on various GLUE tasks. Specifically, we finetune for 5 epochs with a per-device batch size of 8, a fixed random seed of 42, and no weight decay. Unless otherwise stated, we perform grid search over the learning rate values {1e 6, 3e 6, 5e 6, 7e 6, 1e 5, 3e 5, 5e 5, 7e 5} for all optimizers and models. [...] Table 13: Hyperparameters for NANOADAM across tasks. task COLA SST2 MRPC STSB QQP MNLI QNLI mask interval 6 52 7 13 711 306 81 density interval 33 263 14 27 1423 1533 409. [...] Table 18: Common hyperparameters used for finetuning Vi T-Large. Batch Size Seed Weight Decay LR Scheduler Label Smoothing 128 42 0.0 Cosine Annealing LR 0.1 Epochs Task 1 Epochs Task 2 β ϵ 5 5 (0.9, 0.999) 1 10 8. Table 19: Optimizer-specific hyperparameters for Vi T-Large. Optimizer LR Task1 LR Task2 k / k0 Dynamic Density / m Mask Interval NANOADAM 1e-3 2e-3 0.1% off 100 Micro Adam 1e-4 1e-3 0.1% m = 10 Adam W 1e-4 1e-4. |