Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups
Authors: Masih Aminbeidokhti, Subhankar Roy, Eric Granger, Elisa Ricci, Marco Pedersoli
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across six benchmark datasets show that LT-Soups achieves superior trade-offs compared to both PEFT and traditional model soups across a wide range of imbalance regimes. |
| Researcher Affiliation | Academia | 1École de technologie supérieure, 2University of Bergamo 3University of Trento, 4Fondazione Bruno Kessler (FBK) |
| Pseudocode | Yes | Algorithm 1 LT-Soups (Parallelizable Pseudocode) |
| Open Source Code | Yes | Correspondence to: EMAIL. Code at https://github.com/Masseeh/LT-Soups. |
| Open Datasets | Yes | We evaluate our method on both synthetically constructed and naturally occurring long-tailed (LT) datasets. For synthetic benchmarks, we use CIFAR-100-LT, Image Net-LT, and Places-LT longtailed variants derived from their balanced counterparts by sampling class instances according to Pareto or exponential decay distributions [34]. These datasets exhibit sample counts ranging from 1,280 to as few as 5 images per class. For real-world evaluation, we include i Naturalist 2018 (8,142 classes, 437.5K images) and NIH-CXR-LT (20 classes, 88.5K images), which reflect different imbalance structures, with approximately 10% and 90% head classes, respectively. To assess performance across the long-tail spectrum, we also report the average accuracy across all five datasets. Following [34], we evaluate separately on many-shot (>100 samples), medium-shot (20 100), and few-shot (<20) class subsets. For ablation analysis, we use Tiny Image Net-LT, which contains 200 classes with sample counts ranging from 500 in head classes to 5 in tail classes. |
| Dataset Splits | No | The paper mentions that a validation set is used for checkpoint selection ("The validation set of each dataset is used to select the best checkpoint.") and that evaluation is done on class subsets (many-shot, medium-shot, few-shot), but it does not specify the exact percentages, counts, or methodology for the overall training/testing/validation data splits. |
| Hardware Specification | Yes | All models were trained to convergence using a batch size of 128 and mixed-precision training with NVIDIA RTX 3090 GPUs (24GB VRAM), using Python 3.9.15, Py Torch 2.4.0, and CUDA 11.8. |
| Software Dependencies | Yes | All models were trained to convergence using a batch size of 128 and mixed-precision training with NVIDIA RTX 3090 GPUs (24GB VRAM), using Python 3.9.15, Py Torch 2.4.0, and CUDA 11.8. |
| Experiment Setup | Yes | We optimize the model using the Adam W optimizer [36]. The batch size is set to 128, with learning rates of 3e 4 for both the representation and the classification stage. A cosine decay learning rate scheduler is employed, gradually reducing the learning rate to 0.1 max_lr after a warmup period spanning max(100, 0.01 total_steps) steps. The validation set of each dataset is used to select the best checkpoint. Table 11 shows the hyperparameters we used for each dataset. We select N and λ based on the validation set of each dataset and fix M=2 across all experiments. |