Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Thumb on the Scale: Optimal Loss Weighting in Last Layer Retraining

Authors: Nathan Stromberg, Christos Thrampoulidis, Lalitha Sankar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show, in theory and practice, that loss weighting is still effective in this regime, but that these weights must take into account the relative overparameterization of the model. This work explores the regime of last layer retraining (LLR) in which the unseen limited (retraining) data is frequently inseparable and the model proportionately sized, falling between the two aforementioned extremes. We compare this optimal weighting scheme to downsampling and show that optimal w ERM outperforms downsampling, especially when data is limited. Finally, we show that the trends described for simple Gaussian data in theory appear in real-world image classification problems and that our optimal weighting scheme can outperform the classical ratio of priors by leveraging the notion of an effective (latent) dimension. Section 4 Application to Imbalanced Image Classification We finetune a Res Net34 model on the training split of each dataset using cross entropy loss before retraining the final layer with varying ̒̄ on the validation split with square loss (aligning with theory). The focus on square loss may seem like a major restriction, but in practice, square loss is often as performant as cross entropy loss, especially in the low-data fine-tuning setting [27, 28]. To simulate different ̒ values, since the model size is fixed for LLR, we subsample the validation data uniformly to size n. For each n, this retraining is repeated 10 times with different subsamples to get confidence intervals on the captured metrics.
Researcher Affiliation	Academia	Nathan Stromberg Arizona State University EMAIL Christos Thrampoulidis University of British Columbia EMAIL Lalitha Sankar Arizona State University EMAIL
Pseudocode	No	The paper includes theoretical results with theorems, corollaries, and proofs, and describes experimental procedures, but does not feature any explicit 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	Yes	Code is provided in SM and will be released publicly after the review period.
Open Datasets	Yes	We consider the Celeb A [25] dataset which consists of images of celebrity faces, each marked with 40 binary attributes. We also consider a binary version of CIFAR10 [26] with artificial imbalance, where class 1 is truck with 91% of the data and +1 is airplane with 9% of the data.
Dataset Splits	Yes	We finetune a Res Net34 model on the training split of each dataset using cross entropy loss before retraining the final layer with varying ̒̄ on the validation split with square loss (aligning with theory). ... To simulate different ̒ values, since the model size is fixed for LLR, we subsample the validation data uniformly to size n. For each n, this retraining is repeated 10 times with different subsamples to get confidence intervals on the captured metrics. We note that CIFAR10 does not provide a validation split, so we create one from fixed 10% split of the test data.
Hardware Specification	Yes	All empirical experiments were performed using NVIDIA A100 GPUs while simulations were completed on CPU.
Software Dependencies	No	The paper does not list specific version numbers for key software components like programming languages (e.g., Python), machine learning frameworks (e.g., PyTorch, TensorFlow), or other libraries, which are necessary for reproducible software details.
Experiment Setup	Yes	A full list of hyperparameters is provided in Table 1. Parameter Value Backbone Res Net34 Pretrained Weights Imagenet1k-V2 Latent Dimension {128, 256, 512} Optimizer Adam W Learning Rate 1e-3 Full fine-tuning epochs 10 MLP Dropout Rate 0.5 Fine-tuning epochs 30 Fine-tuning LR 1e-2