Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping

Authors: Haonan Dong, Wenhao Zhu, Guojie Song, Liang Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 22 datasets and 6 pretrained models demonstrate that Auro RA: (I) not only matches or surpasses full fine-tuning performance with only 6.18% 25% of Lo RA s parameters but also (II) outperforms competitive PEFT methods by up to 10.88% in both NLP and CV tasks, and (III) exhibits robust performance across various rank configurations.
Researcher Affiliation	Collaboration	1State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, 2Alibaba Group, Corresponding author # EMAIL, EMAIL
Pseudocode	Yes	The algorithm framework is presented in Algo. 1. Algorithm 1: Algorithm workflow of Auro RA
Open Source Code	Yes	The source code is available at here. ... We provide the code necessary for replicating the studies described in this paper via an anonymous link, and we detail the experimental setup for the replication in the article itself (See Appendix).
Open Datasets	Yes	For our experiments, we evaluate the ability of Auro RA to achieve parameter-efficient fine-tuning using four categories of datasets spanning both NLP and CV domains: Natural Language Understanding: We employ GLUE (General Language Understanding Evaluation) [39], a widely used multi-task benchmark in NLU, which includes datasets such as SST-2, MRPC, Co LA, QNLI, RTE, and STS-B. ... Commonsense Reasoning: We use a collection of commonly used datasets, including Bool Q [40], PIQA [41], Social IQA [42], Hella Swag [43], Wino Grande [44], ARC-e, ARC-c [45], and Open Book QA [46]. ... Image Classification: We use five datasets with small label spaces Oxford Pets [47], CIFAR-10 [48], DTD [49], Euro SAT [50], and RESISC45 [51], and three datasets with large label spaces, namely Stanford Cars [52], FGVC [53], and CIFAR-100 [48]. Subject-Driven Generation: Following [54], we use the Dream Booth dataset.
Dataset Splits	Yes	Table 8: Detailed task descriptions and dataset statistics for the GLUE benchmark. ... # Train # Val # Test ... Table 9: Details of datasets being evaluated in commonsense reasoning task. ... #Train #Test ... Table 10: Details of the datasets for the Image Classification task. ... #Train #Val #Test
Hardware Specification	Yes	image classification tasks run on four NVIDIA Ge Force RTX 4090 (24GB) GPUs. Commonsense reasoning and subject-driven generation tasks run on NVIDIA L20 (48GB).
Software Dependencies	No	The paper mentions "Optimizer Adam W" in Tables 11, 12, and 13, but does not specify version numbers for any software libraries or programming languages.
Experiment Setup	Yes	To ensure the reproducibility of our experimental results, we provide the detailed hyperparameter settings used in our experiments. In all of our experiments, to achieve a better balance between parameter count and performance, we set the hidden layer dimension (Rank er) of Auro RA to 2. Correspondingly, we set the hyperparameter α of Auro RA to 4. ... Tables 11, 12, 13 list specific hyperparameters for different tasks, including Optimizer, LR Schedule, Epochs, Learning Rate, Max Seq. Len, Batch Size, Dropout, Warmup Steps, and Weight Decay.