Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PoLAR: Polar-Decomposed Low-Rank Adapter Representation

Authors: Kai Lion, Liang Zhang, Bingcong Li, Niao He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that low-rank adaptation of large-scale models suffers from a low stable rank that is well below the linear algebraic rank of the subspace, degrading finetuning performance. To mitigate the underutilization of the allocated subspace, we propose Po LAR, a parameterization inspired by the polar decomposition that factorizes the low-rank update into two direction matrices constrained to Stiefel manifolds and an unconstrained scale matrix. Our theory shows that Po LAR yields an exponentially faster convergence rate on a canonical low-rank adaptation problem. Pairing the parameterization with Riemannian optimization leads to consistent gains on three different benchmarks testing general language understanding, commonsense reasoning, and mathematical problem solving with base model sizes ranging from 350M to 27B.
Researcher Affiliation	Academia	Kai Lion Liang Zhang Bingcong Li Niao He Department of Computer Science ETH Zurich Zurich, Switzerland EMAIL
Pseudocode	Yes	Algorithm 1 RGD for Po LAR parameterized (4) ... Algorithm 2 Po LAR Fine-tuning ... Algorithm 3 RGD for Po LAR-parameterized (25)
Open Source Code	Yes	The code for our experiments is available at https://github.com/kcc-lion/polar/.
Open Datasets	Yes	We consider the following tasks: Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2019), SIQA (Sap et al., 2019), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2019), ARC-e and ARC-c (Clark et al., 2018), and Openbook QA (Mihaylov et al., 2018). ... For the mathematical reasoning experiment, we tune the learning rate grid in {2 10 4, 4 10 4, 6 10 4}. We train for 2 epochs on Meta Math QA (Yu et al., 2024) using rank 16, batch size 128, and tune λ {10 3, 5 10 3}. We evaluate on GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) using lm-evaluation-harness and the above prompt. ... Language Understanding. GLUE (Wang et al., 2019) is designed to provide a general-purpose evaluation of language understanding.
Dataset Splits	Yes	For evaluation, we employ the widely-adopted lm-evaluation-harness framework from Eleuther-AI (Biderman et al., 2024). We report the accuracy based on multiple-choice log-likelihood evaluation to facilitate reproducibility.
Hardware Specification	Yes	Experiments are performed on either of NVIDIA GH200 and NVIDIA H100 GPUs.
Software Dependencies	No	We use Py Torch (Paszke et al., 2019) for all experiments.
Experiment Setup	Yes	For the results in Table 10, we train for 5 epochs on each task with batch size 128 and choose the learning rate within {4 10 4, 8 10 4, 4 10 3}. We tune λ {10 3, 5 10 3} for Po LAR and set α = 32. ... For the mathematical reasoning experiment, we tune the learning rate grid in {2 10 4, 4 10 4, 6 10 4}. We train for 2 epochs on Meta Math QA (Yu et al., 2024) using rank 16, batch size 128, and tune λ {10 3, 5 10 3}.