Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation

Authors: Uliana Parkina, Maxim Rakhuba

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 Experiments In this section, we evaluate the effectiveness of our regularization-based compression approach in practice. [...] Figure 4: Comparison of the impact of parameter tuning with (Equation (5)) and without considering layer-wise norms on model quality at 70% compression, evaluated on a common-sense reasoning dataset using the Mistral-7B-Instruct model. [...] Table 1: Computation times produced by different methods. Model #Samples Strategy Time, s LLa MA3-1B 64 SVD-LLM 273.93 22.12 [...] The results indicate that in all the considered settings our regularized algorithm systematically achieves better metrics during compression. 6.2 Fine-Tuning Table 4: Results of fine-tuning LLa MA3-1B-Instruct at rank r = 8 using different PEFT initialization methods on the commonsense reasoning dataset with 24 examples for initialization.
Researcher Affiliation	Academia	Uliana Parkina HSE University EMAIL Maxim Rakhuba HSE University
Pseudocode	Yes	Algorithm 1 A Stable Solution to the Weighted Low-Rank Approximation Problem [...] Algorithm 2 A solution to the weighted low-rank approximation problem with regularization
Open Source Code	Yes	1Our code is available at https://github.com/urparkina/COALA.
Open Datasets	Yes	We conducted experiments on the models LLa MA3-8B, LLa MA3-1B [16] and Mistral-7B [5] (including Insrtuct versions), comparing our approach with existing methods across various datasets: bool Q [8], Openbook QA [35], Wino Grande [38], Hella Swag [51], Arc_e [9], Arc_c [10], PIQA [4], MMLU [18].
Dataset Splits	No	The paper states: 'using text samples from the commonsense reasoning dataset, which was also used for validation.' and 'All training runs were conducted on the same dataset consisting of 40,000 examples, presented in the same order across all experiments.' While it mentions the total number of examples for some tasks and a validation dataset, it does not provide specific percentages or counts for training, testing, and validation splits across all experiments, or references to predefined splits with citations for these specific splits.
Hardware Specification	Yes	All calculations were performed on a single NVIDIA A100 GPU. [...] All experiments were performed on an NVIDIA Tesla T4 GPU with Driver Version 535.183.01 and CUDA Version 12.2.
Software Dependencies	Yes	All experiments were performed on an NVIDIA Tesla T4 GPU with Driver Version 535.183.01 and CUDA Version 12.2.
Experiment Setup	Yes	Table 5: Choice of hyperparameters for different methods, which were applied to the matrices Q, K, V, O, Up, Down. Hyperparameter Lo RA Pi SSA Cor DA COALA Rank r 8 8 8 8 α 12 4 1 2 8 Dropout 0.0 0.0 0.0 0.0 Optimizer Adam W Adam W Adam W Adam W Learning Rate 1 10 4 1 10 4 1 10 4 1 10 4 LR Scheduler Cosine Cosine Cosine Cosine Batch Size 16 16 16 16 Warmup Steps 100 100 100 100 Epochs 1 1 1 1