Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups

Authors: Masih Aminbeidokhti, Subhankar Roy, Eric Granger, Elisa Ricci, Marco Pedersoli

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across six benchmark datasets show that LT-Soups achieves superior trade-offs compared to both PEFT and traditional model soups across a wide range of imbalance regimes.
Researcher Affiliation	Academia	1École de technologie supérieure, 2University of Bergamo 3University of Trento, 4Fondazione Bruno Kessler (FBK)
Pseudocode	Yes	Algorithm 1 LT-Soups (Parallelizable Pseudocode)
Open Source Code	Yes	Correspondence to: EMAIL. Code at https://github.com/Masseeh/LT-Soups.
Open Datasets	Yes	We evaluate our method on both synthetically constructed and naturally occurring long-tailed (LT) datasets. For synthetic benchmarks, we use CIFAR-100-LT, Image Net-LT, and Places-LT longtailed variants derived from their balanced counterparts by sampling class instances according to Pareto or exponential decay distributions [34]. These datasets exhibit sample counts ranging from 1,280 to as few as 5 images per class. For real-world evaluation, we include i Naturalist 2018 (8,142 classes, 437.5K images) and NIH-CXR-LT (20 classes, 88.5K images), which reflect different imbalance structures, with approximately 10% and 90% head classes, respectively. To assess performance across the long-tail spectrum, we also report the average accuracy across all five datasets. Following [34], we evaluate separately on many-shot (>100 samples), medium-shot (20 100), and few-shot (<20) class subsets. For ablation analysis, we use Tiny Image Net-LT, which contains 200 classes with sample counts ranging from 500 in head classes to 5 in tail classes.
Dataset Splits	No	The paper mentions that a validation set is used for checkpoint selection ("The validation set of each dataset is used to select the best checkpoint.") and that evaluation is done on class subsets (many-shot, medium-shot, few-shot), but it does not specify the exact percentages, counts, or methodology for the overall training/testing/validation data splits.
Hardware Specification	Yes	All models were trained to convergence using a batch size of 128 and mixed-precision training with NVIDIA RTX 3090 GPUs (24GB VRAM), using Python 3.9.15, Py Torch 2.4.0, and CUDA 11.8.
Software Dependencies	Yes	All models were trained to convergence using a batch size of 128 and mixed-precision training with NVIDIA RTX 3090 GPUs (24GB VRAM), using Python 3.9.15, Py Torch 2.4.0, and CUDA 11.8.
Experiment Setup	Yes	We optimize the model using the Adam W optimizer [36]. The batch size is set to 128, with learning rates of 3e 4 for both the representation and the classification stage. A cosine decay learning rate scheduler is employed, gradually reducing the learning rate to 0.1 max_lr after a warmup period spanning max(100, 0.01 total_steps) steps. The validation set of each dataset is used to select the best checkpoint. Table 11 shows the hyperparameters we used for each dataset. We select N and λ based on the validation set of each dataset and fix M=2 across all experiments.