Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold

Authors: Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Lingjuan Lyu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of Lo RA. Code is available at https://github.com/Sony Research/stella.
Researcher Affiliation	Industry	Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Lingjuan Lyu Sony AI Zurich, Switzerland EMAIL
Pseudocode	Yes	Algorithm 1 Stel LA: Stiefel Low-Rank Adaptation Require: Pre-trained weight W Rm n, loss function L, a Euclidean optimizer s step function step , rank r, scale factor α, number of iterations T.
Open Source Code	Yes	Code is available at https://github.com/Sony Research/stella.
Open Datasets	Yes	Models and Datasets. We evaluate the performance of Stel LA on the commonsense reasoning benchmark, which assesses the reasoning capabilities of large language models across 8 sub-tasks. Following the setup of Liu et al. [40], we train on the combined data from all sub-tasks and evaluate on the test set. We fine-tune two popular LLM checkpoints, LLa MA2-7B [60] and LLa MA3-8B [21].
Dataset Splits	No	The paper describes using standard datasets and general training/evaluation setups (e.g., "train on the combined data from all sub-tasks and evaluate on the test set," "measure the validation top-1 accuracy"), but it does not provide explicit numerical dataset split information (percentages or counts) for all experiments within the main text of this paper. For some experiments, it defers to prior work protocols.
Hardware Specification	Yes	In practice, training a Lo RA-adapted LLa MA3-8B model on a commonsense reasoning benchmark takes approximately 4.5 hours on a single H100 GPU, whereas training the same model with Stel LA takes around 5.2 hours, about only 15% slower than vanilla Lo RA.
Software Dependencies	No	We implement Stel LA in Py Torch [48] using optimizer hooks. Specifically, line 5 is implemented as a pre-hook to the optimizer step, while lines 7 8 are implemented as a post-hook. Our implementation is readily integrable with Hugging Face s PEFT library [42], enabling easy adoption by the community. Specifically, we use the gesvda solver [49], which is a CUDA-accelerated SVD implementation that can handle tall matrices efficiently.
Experiment Setup	Yes	For fair comparison, we fix the rank to 32, α to 64, batch size to 16, weight decay to 0, dropout to 0.05, and train for 3 epochs using Adam W. The learning rate is separately tuned for each method and follows a linear decay schedule.