Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RidgeLoRA: Matrix Ridge Enhanced Low-Rank Adaptation of Large Language Models

Authors: Junda Zhu, Jun Ai, Yujun Li, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Furthermore, extensive experiments across multiple domains demonstrate that Ridge Lo RA achieves better performance than other Lo RA variants, and can even match or surpass full-rank training.
Researcher Affiliation	Collaboration	1Beihang University 2Huawei Noah s Ark Lab
Pseudocode	Yes	Algorithm 1 Weight Initialization of Ridge Lo RA Input: Input dimension din, Target low rank r, Scaling factor α Output: λ, Σ, A, B
Open Source Code	Yes	https://github.com/chuhac/Ridge Lo RA
Open Datasets	Yes	Datasets In order to showcase the validity of Ridge Lo RA and demonstrate its good performance. In comparisons with state-of-the-art low rank methods, we conduct comprehensive experiments across different tasks, which are widely utilized for evaluation in previous works: namely, (i) Commonsense Reasoning, (ii) Math & Code Problems and (iii) Multi-modal Understanding tasks. We adopted all of its training split for fine-tuning for fixed number of steps to ensure fair comparisons. Reported metrics are evaluated on the official test splits. i. For the Commonsense Reasoning datasets, following previous works, we conduct multi-task training with the training split of eight related datasets, namely Bool Q [66], PIQA [67], Social IQA [68], Hella Swag [69], Wino Grande [70], ARC-Easy, ARC-Challenge [71] and Openbook QA [72]. ii. As for Math&Code problems, we evaluate LLMs with GSM8K [73] and MATH [74] for math, Human Eval [75] and MBPP [76] for code capability. We conduct supervised fine-tuning with Meta Math [77] and Code-Feedback 4 for math and code, respectively, to ensure there is no data leakage. iii. For the multi-modal understanding datasets, we include GQA [78], Science QA (79, SQA in Table 4), Text VQA (80, VQAT in Table 4) and POPE [81] to test how well the trained model fits with the vision projector and understands the images. iv. We also adopt the GLUE benchmark 5 for the NLU tasks, which consists of the following datasets: a) Single Sentence Classification Tasks: SST-2 [82] or the Stanford Sentiment Treebank s goal is to predict the sentiment (positive/negative) of reviews on different movies, which is a binary classification task. Co LA [83] or the Corpus of Linguistic Acceptability consists of sentences each annotated with whether it is a grammatical English sentence. b) Similarity or Paraphrase Tasks: MRPC [84] or the Microsoft Research Paraphrase Corpus is to identify if a sentence pair consists of sentences paraphrases of each other. QQP or Quora Question Pairs 6 is to determine whether two questions are semantically equivalent, question pairs are collected from the website Quora. STS-B [85] or the Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. The task is to evaluate how similar two chunks of texts are with a score from 1 to 5. c) Language Entailment Tasks: MNLI [86] or the Multi-Genre Natural Language Inference is a crowdsourced dataset of sentence pairs with entailment annotations, sourced from diverse materials like speech, fiction, and reports, evaluated on both in-domain and cross-domain sections using private labels. QNLI or the Question Natural Language Inference consists of question-paragraph pairs from Wikipedia, originally from the SQu AD [87] and post processed when building GLUE. RTE or the Recognizing Textual Entailment is a binary entailment task with a small training dataset, which consists of sentence pairs from four annual textual entailment challenges [88 91].
Dataset Splits	Yes	We adopted all of its training split for fine-tuning for fixed number of steps to ensure fair comparisons. Reported metrics are evaluated on the official test splits. For the Commonsense Reasoning datasets, following previous works, we conduct multi-task training with the training split of eight related datasets
Hardware Specification	No	No specific hardware details (like GPU/CPU models) are provided for running the experiments. The paper mentions the base models used (e.g., Llama-2-7B, Llama-3.1-8B) and discusses computational constraints in the limitations, but not the hardware specifications for the experimental runs.
Software Dependencies	No	No specific software dependencies with version numbers are provided. The paper only mentions using Adam W [48] as the optimizer, but no programming languages, frameworks, or library versions.
Experiment Setup	Yes	Throughout our experiments, we adopt a cosine learning rate schedule and use Adam W [48] as the optimizer. Unless otherwise specified, all Lo RA variants share the same maximum learning rate for a given task. Concretely, we use a learning rate of 2e-5 for Math & Code tasks, and 3e-5 for Commonsense tasks. The rank of the low-rank matrices is set to 64 for Math & Code tasks, and 8 for Commonsense tasks.