Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Authors: Leyang Hu, Matteo Gamba, Randall Balestriero

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, CT improves both generalization and robustness. For example, it boosts downstream accuracy of Res Net-50/152 by 8.59%/8.34% over linear probing and 4.64%/1.70% over Lo RA across 12 datasets, and improves robust accuracy on the ℓ benchmark from Robust Bench by 1032.64%/1494.46%. Our code is available at https://github.com/Leon-Leyang/curvature-tuning.
Researcher Affiliation	Academia	Leyang Hu Brown University EMAIL Matteo Gamba KTH EMAIL Randall Balestriero Brown University EMAIL
Pseudocode	Yes	Appendix D provides pseudocode for S-CT and T-CT. The following code provides the Python implementation for S-CT and T-CT: SCTU & TCTU: classes that define the CTU module used in S-CT and T-CT, respectively. replace_module & replace_module_dynamic: functions that apply the appropriate module replacement to integrate S-CT or T-CT into a model.
Open Source Code	Yes	Our code is available at https://github.com/Leon-Leyang/curvature-tuning.
Open Datasets	Yes	Specifically, we transfer Image Net-pretrained Res Net-18, Res Net-50, Res Net-152 and VGG-11 models to 12 downstream tasks, including Arabic Characters [30], Arabic Digits [31], Beans [32], CUB-200-2011 [33], DTD [34], Fashion MNIST [35], FGVC-Aircraft [36], Flowers102 [37], Food101 [38], and three subsets from Med MNIST Path MNIST, OCTMNIST, and Derma MNIST [39].
Dataset Splits	Yes	Each dataset is split into training/validation/test sets (details in Appendix B.1). If a dataset does not include a validation set, we hold out 20% of the training data using stratified sampling.
Hardware Specification	Yes	All experiments were conducted using 8 RTX 3090 GPUs and one L40 GPU, with runs performed under random seeds 42, 43, and 44.
Software Dependencies	No	The paper includes Python code snippets using libraries such as 'torch' and 'torch.nn' in Appendices D and E, but it does not specify explicit version numbers for these or any other software dependencies.
Experiment Setup	Yes	All linear classifiers are trained for 20 epochs using the Adam optimizer with a learning rate of 10^-3. We apply linear warm-up during the first epoch and decay the learning rate by a factor of 10 after epoch 10. For T-CT, a learning rate of 10^-1 is applied to the (β, c) parameters and 10^-3 to the linear classifier; for Lo RA, a learning rate of 10^-4 is used for both the adapter parameters and the classifier. In T-CT, we initialize all β parameters to 0.8 and all c parameters to 0.5.