Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improved Balanced Classification with Theoretically Grounded Loss Functions

Authors: Corinna Cortes, Mehryar Mohri, Yutao Zhong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We report the results of experiments demonstrating that, empirically, both the GCA losses with calibrated class-dependent confidence margins and GLA losses can greatly outperform straightforward class-weighted losses as well as the LA losses. GLA generally performs slightly better in common benchmarks, whereas GCA exhibits a slight edge in highly imbalanced settings.
Researcher Affiliation	Industry	Corinna Cortes Google Research New York, NY 10011 EMAIL Mehryar Mohri Google Research & CIMS New York, NY 10011 EMAIL Yutao Zhong Google Research New York, NY 10011 EMAIL
Pseudocode	No	The paper describes algorithms and loss functions mathematically but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: See Section 6 and Appendix C.
Open Datasets	Yes	Section 6 reports empirical results on CIFAR-10, CIFAR-100 [Krizhevsky, 2009], and Tiny Image Net [Le and Yang, 2015] datasets with respectively 10, 100 and 200 classes.
Dataset Splits	Yes	To simulate class imbalance, we reduced the percentage of examples per class identically in both training and test sets, following exactly the protocol in [Menon et al., 2021]. ... For all methods, including our GLA and GCA losses, we tune the hyperparameters using a validation set held out separately from the training set. ... Performance was primarily evaluated using the balanced error on the imbalanced test sets (i.e., the average of the balanced loss over the test data).
Hardware Specification	Yes	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Model training was performed using hardware accelerators providing the equivalent computational power of 64 GPUs.
Software Dependencies	No	All models were trained for 200 epochs using Stochastic Gradient Descent (SGD) with Nesterov momentum [Nesterov, 1983]. We used a a batch size of 1,024, a weight decay of 1 10 3, and a cosine decay learning rate schedule [Loshchilov and Hutter, 2016] without restarts, with an initial learning rate of 0.2. The paper mentions optimizers and techniques but does not specify software versions (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Our experimental setup, including training procedures and neural network architectures, strictly followed Menon et al. [2021]. We used a Res Net-32 architecture with Re LU activations [He et al., 2016]. Standard data augmentation techniques were applied: for CIFAR-10 and CIFAR-100, this involved 4-pixel padding followed by 32 32 random crops and random horizontal flips; for Tiny Image Net, 8-pixel padding was used, followed by 64 64 random crops. All models were trained for 200 epochs using Stochastic Gradient Descent (SGD) with Nesterov momentum [Nesterov, 1983]. We used a a batch size of 1,024, a weight decay of 1 10 3, and a cosine decay learning rate schedule [Loshchilov and Hutter, 2016] without restarts, with an initial learning rate of 0.2. ... For all methods, including our GLA and GCA losses, we tune the hyperparameters using a validation set held out separately from the training set. ... Further details about the experiments including baselines are provided in Appendix C.