Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Accurate and Efficient Low-Rank Model Merging in Core Space

Authors: Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bartłomiej Twardowski, Andrew D. Bagdanov, SIMONE CALDERARA, Joost van de Weijer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at https://github.com/apanariello4/core-space-merging. 5 Experimental Results LLMs merging. We present Llama 3 8B results in natural language inference in Tab. 2. In line with our complexity analysis, merging in Core Space is much more efficient than merging in Full or Kn OTS space, bringing up to 600 merging speed-up. Moreover, merging in Core Space improves the performance of all tested merging methods. In particular, it elevates TSV to 94.16% average normalized accuracy, achieving state-of-the-art results. Per-task evaluation in vision setting. We present per-task vision results for Vi T-B/32 in Tab. 3.
Researcher Affiliation	Academia	1AImage Lab, University of Modena and Reggio Emilia, Italy 2Warsaw University of Technology, Poland 3IDEAS NCBR, Warsaw, Poland 4Media Integration and Communication Center (MICC), University of Florence, Italy 5IDEAS Research Institute, Warsaw, Poland 6Computer Vision Center, Universitat Autònoma de Barcelona, Spain
Pseudocode	Yes	Algorithm 1 Core Matrix Alignment and Merging Require: Low-rank updates {(A(t), B(t))}T t=1, merging function M( ). 1: Stack A(t) vertically, B(t) horizontally 2: Compute V ref A : stack(A(t)) = UAΣAV ref A reference bases 3: Compute U ref B : stack(B(t)) = U ref B ΣBV B 4: for t = 1 to T do 5: Compute: M (t) = (U ref B B(t))(A(t)V ref A ) Eq. (8) 6: Merge aligned core matrices: Mmerged = M({M (t)}T t=1) 7: return W = U ref B Mmerged V ref A reconstructed merged model
Open Source Code	Yes	Codebase is available at https://github.com/apanariello4/core-space-merging. Our Core Space merging implementation is released at https://github.com/apanariello4/ core-space-merging.
Open Datasets	Yes	D Additional Experiment Details Licenses of Used Datasets and Models In our research, we employed publicly available datasets and models, each governed by speciﬁc licenses. Below, we outline the sources and associated licenses for each: Kn OTS Lo RA Checkpoints [42]: The Kn OTS repository, which provides Lo RA-adapted model checkpoints and training scripts, is licensed under the MIT License. This permissive license allows for reuse and modiﬁcation with proper attribution. Cars196 [22]: The Cars196 dataset is available for non-commercial research purposes. Speciﬁc licensing details are not explicitly provided. Describable Textures Dataset (DTD) [7]: The DTD is made available to the computer vision community for research purposes. The dataset is licensed under the Creative Commons Attribution 4.0 License (CC BY 4.0). Euro SAT [14]: The Euro SAT dataset is licensed under the MIT License. German Trafﬁc Sign Recognition Benchmark (GTSRB) [41]: The GTSRB dataset is licensed under the Creative Commons Zero (CC0) Public Domain Dedication. MNIST [23]: The MNIST dataset is publicly available for research purposes. Speciﬁc licensing details are not explicitly provided; users are advised to consult the dataset s source for more information. NWPU-RESISC45 [5]: The NWPU-RESISC45 dataset is licensed under the Creative Commons Attribution 4.0 License (CC BY 4.0). SUN397 [49]: The SUN397 dataset is available for research purposes only. Speciﬁc licensing details are not explicitly provided; users are advised to consult the dataset s source for more information. Street View House Numbers (SVHN) [33]: The SVHN dataset is available for noncommercial use only. Stanford Natural Language Inference (SNLI) [3]: The SNLI dataset is licensed under the Creative Commons Attribution-Share Alike 4.0 International License (CC BY-SA 4.0). Multi-Genre Natural Language Inference (MNLI) [47]: The MNLI dataset is released under the Open American National Corpus (OANC) license, which permits free use, modiﬁcation, and sharing under permissive terms. Sentences Involving Compositional Knowledge (SICK) [30]: The SICK dataset is distributed under a Creative Commons Attribution-Non Commercial-Share Alike license. Question Natural Language Inference (QNLI) [45]: The QNLI dataset is part of the GLUE benchmark. Speciﬁc licensing details are not explicitly provided; users are advised to consult the dataset s source for more information. Recognizing Textual Entailment (RTE) [45]: The RTE dataset is part of the GLUE benchmark. Speciﬁc licensing details are not explicitly provided; users are advised to consult the dataset s source for more information. Sci Tail [20]: The Sci Tail dataset is licensed under the Apache License 2.0.
Dataset Splits	No	To find optimal hyperparameters for each model, we adopt the widely used validation holdout strategy [42, 28, 12, 50]. Specifically, we perform a linear search for hyperparameters on the validation set, starting from a defined minimum value and incrementally increasing it until performance declines, indicating the optimal range. The identified optimal hyperparameters are then applied to the test set.
Hardware Specification	Yes	D.1 Experimental Environment The language experiments with Llama 3 8B were performed with a single 48G NVIDIA L40S. In contrast, the more affordable vision experiments were executed using a single 16G NVIDIA RTX 4080. To keep things fair, the reported times for the language experiments all refer to experiments performed on the same machine.
Software Dependencies	No	Our implementation builds directly on the Kn OTS codebase [42] and uses the exact Lo RA checkpoints they released.
Experiment Setup	Yes	We employ Llama 3 8B [13] fine-tuned on 6 NLI tasks for the language experiments. All models are fine-tuned with Lo RA [16] with rank 16 applied on all matrices (keys, queries, values, and outputs) across all attention layers. D.2 Hyperparameter Search To find optimal hyperparameters for each model, we adopt the widely used validation holdout strategy [42, 28, 12, 50]. Specifically, we perform a linear search for hyperparameters on the validation set, starting from a defined minimum value and incrementally increasing it until performance declines, indicating the optimal range. The identified optimal hyperparameters are then applied to the test set. We use the following search settings: Scaling factor α starts at 0.1, increasing in increments of 0.1. This is used for every approach. The top-K parameter for TIES and DARE-TIES begins at 10 and increases in increments of 10. The pruning factor p for DARE-TIES starts at 0.1 and increases in increments of 0.1. For CART, the pruning rank is searched over the set {0.04, 0.08, 0.16, 0.32}, following the methodology of the original paper. Additionally, CART includes an extra scaling factor λ in its merging formulation. Specifically, the merged weights are computed as Wmerged = W0 + α(θavg + λ PT t=1 τt), where θavg denotes the average of the updates and τt represents the centered task vector for task t. For further details, we refer the reader to [6].