Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold
Authors: Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Lingjuan Lyu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of Lo RA. Code is available at https://github.com/Sony Research/stella. |
| Researcher Affiliation | Industry | Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Lingjuan Lyu Sony AI Zurich, Switzerland EMAIL |
| Pseudocode | Yes | Algorithm 1 Stel LA: Stiefel Low-Rank Adaptation Require: Pre-trained weight W Rm n, loss function L, a Euclidean optimizer s step function step , rank r, scale factor α, number of iterations T. |
| Open Source Code | Yes | Code is available at https://github.com/Sony Research/stella. |
| Open Datasets | Yes | Models and Datasets. We evaluate the performance of Stel LA on the commonsense reasoning benchmark, which assesses the reasoning capabilities of large language models across 8 sub-tasks. Following the setup of Liu et al. [40], we train on the combined data from all sub-tasks and evaluate on the test set. We fine-tune two popular LLM checkpoints, LLa MA2-7B [60] and LLa MA3-8B [21]. |
| Dataset Splits | No | The paper describes using standard datasets and general training/evaluation setups (e.g., "train on the combined data from all sub-tasks and evaluate on the test set," "measure the validation top-1 accuracy"), but it does not provide explicit numerical dataset split information (percentages or counts) for all experiments within the main text of this paper. For some experiments, it defers to prior work protocols. |
| Hardware Specification | Yes | In practice, training a Lo RA-adapted LLa MA3-8B model on a commonsense reasoning benchmark takes approximately 4.5 hours on a single H100 GPU, whereas training the same model with Stel LA takes around 5.2 hours, about only 15% slower than vanilla Lo RA. |
| Software Dependencies | No | We implement Stel LA in Py Torch [48] using optimizer hooks. Specifically, line 5 is implemented as a pre-hook to the optimizer step, while lines 7 8 are implemented as a post-hook. Our implementation is readily integrable with Hugging Face s PEFT library [42], enabling easy adoption by the community. Specifically, we use the gesvda solver [49], which is a CUDA-accelerated SVD implementation that can handle tall matrices efficiently. |
| Experiment Setup | Yes | For fair comparison, we fix the rank to 32, α to 64, batch size to 16, weight decay to 0, dropout to 0.05, and train for 3 epochs using Adam W. The learning rate is separately tuned for each method and follows a linear decay schedule. |