Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Latent Space Factorization in LoRA

Authors: Shashi Kumar, Yacouba Kaloga, John Mitros, Petr Motlicek, Ina Kodrasi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on text, audio, and image tasks demonstrate that FVAE-Lo RA consistently outperforms standard Lo RA. Moreover, spurious correlation evaluations confirm that FVAE-Lo RA better isolates task-relevant signals, leading to improved robustness under distribution shifts. Our code is publicly available at: https://github.com/idiap/FVAE-Lo RA
Researcher Affiliation	Academia	1Idiap Research Institute, Switzerland 2EPFL, Switzerland 3BUT, Czech Republic EMAIL EMAIL
Pseudocode	No	The paper describes the methodology using mathematical formulations and descriptive text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is publicly available at: https://github.com/idiap/FVAE-Lo RA
Open Datasets	Yes	We evaluate FVAE-Lo RA on six diverse image classification datasets: DTD [32], Euro SAT [33], GTSRB [34], RESISC45 [35], SUN397 [36], and SVHN [37]. These datasets span various image types, domains, and complexities. For natural language tasks, we use two benchmark categories: 1. Commonsense Reasoning: Training is done on a predefined corpus [39]1 of query-answer pairs, and the evaluation set includes seven sub-tasks: PIQA [40] (physical commonsense), SIQA [41] (social interaction understanding), ARC-c and ARC-e [42] (science question answering), OBQA [43] (multi-hop reasoning over facts), Hella Swag [44] (commonsense natural language inference)), and Wino Grande [45] (fill-in-the-blank). 2. GLUE Benchmark: A subset of the GLUE [46] is used, comprising SST2 (sentiment analysis), Co LA (linguistic acceptability), QNLI (question-answering NLI), MRPC (paraphrase detection), RTE (textual entailment), STSB (semantic textual similarity), and WNLI (coreference resolution). We conduct automatic speech recognition (ASR) on the TIMIT acoustic-phonetic corpus [49] for phoneme recognition. Following prior works [54, 51, 52, 53, 55, 56], we consider three standard benchmarks to introduce spurious correlations: Waterbirds [56], where bird type (landbird vs. waterbird) is correlated with background (land vs. water); Celeb A [56], where a target attribute (e.g., blonde hair) might be correlated with another attribute (e.g., being female); and Animals [57], a larger-scale dataset derived from Image Net [58] with four animal classes spuriously correlated with background types (e.g., waterbirds with water, small dogs with indoor scenes).
Dataset Splits	Yes	Table 7: Statistics of the datasets used in the spurious experiment. Dataset Spu Co Animals Waterbirds Celeb A # Classes 4 2 2 # Groups 8 4 4 Train 42000 4795 162770 Validation 2100 1199 19867 Test 4000 5794 19962 Class Ratio 25:25:25:25 76.8:23.2 85:15
Hardware Specification	No	The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models, memory details) used for running its experiments. The limitations section (J) also states that computational cost is not reported, which typically correlates with a lack of hardware specifics.
Software Dependencies	No	The paper mentions using "Optimizer Adam W" in its hyperparameter tables (e.g., Table 8, 9, 10, 11), but it does not specify version numbers for any software libraries (e.g., PyTorch, TensorFlow, scikit-learn, CUDA) or the programming language used.
Experiment Setup	Yes	This section details the hyperparameters used for the experiments presented in the main paper. For all Lo RA-based methods, including FVAE-Lo RA, the Lo RA rank (r) was set to 16, and Lo RA was applied to the query and key matrices of the attention layers. The latent dimension of z1 in FVAE-Lo RA corresponds to this Lo RA rank. Table 8: Hyperparameters for Image Classification tasks using Vi T-B/16. Parameter Value / Setting General Training Parameters Optimizer Adam W Learning Rate 5 10-3 LR Scheduler Linear Warmup Ratio 0.1 Batch Size 32 Number of Epochs 30 Weight Decay 0.01 Seeds 1, 2, 42 Lo RA Parameters Lo RA Rank (r) 16 Lo RA Dropout 0.1 FVAE-Lo RA Specific Parameters Latent Dim. z1 16 (same as Lo RA rank) Latent Dim. z2 16 FVAE qϕi(zi\|x) Enc. Arch. x Linear dim(zi) Re LU Hidden Statezi Linear (µzi, log σ2 zi) FVAE pθ(x\|z1, z2) Dec. Arch. Concat(z1, z2) Linear HD = 128 Re LU Linear ˆx (Input Dim) Prior p1(z1) N(0, I) Prior p2(z2) N(1.5, I) λ (Eq. 6) 1 10-3 ELBO Coeff. α (Reconstr.) 1 ELBO Coeff. β (KL q1\|\|p1) 1 or 10 ELBO Coeff. δ 1