Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bandit Guided Submodular Curriculum for Adaptive Subset Selection

Authors: Prateek Chanda, Prayas Agrawal, Saral Sureka, Lokesh Reddy Polu, Atharv Kshirsagar, Ganesh Ramakrishnan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, ONLINESUBMOD outperforms both traditional curriculum learning and bi-level optimization approaches across vision and language datasets, showing superior accuracy-efficiency tradeoffs. More broadly, we show that validationdriven reward metrics offer a principled way to guide the curriculum schedule. Our code is publicly available at Git Hub 2.
Researcher Affiliation	Academia	Department of Computer Science and Engineering, Indian Institute of Technology Bombay {prateekch, prayas, ssaral EMAIL
Pseudocode	Yes	Algorithm 1: ONLINESUBMOD
Open Source Code	Yes	Our code is publicly available at Git Hub 2. 2https://github.com/efficiency-learning/banditsubmod/
Open Datasets	Yes	All vision-related experiments are conducted using NVIDIA 3 A6000 GPUs, while large language model (LLM) experiments are performed on 8 H100 GPUs to ensure fair comparisons with all baselines. We share more details in Appendix Section D. 5.1 Finetuning Large Language Models Model-Training-Evaluation Pairs. We evaluate ONLINESUBMOD using combinations of two LLMs: LLAMA-2-7B [49] and MISTRAL-7B [16] finetuned on LESS [55], with performance assessed on MMLU and TYDIQA (Table 1). We use batch size of 16 and use 2 random validation points for computing the reward utility. We select 50% of the batch data for gradient updates during each step. ... We showcase the utility of our method across 5 datasets primarily CIFAR10, CIFAR100 [22],TINYIMAGENET[23], MNIST [24] and SVHN [36].
Dataset Splits	Yes	For the MNIST dataset, we use 60,000 training instances, 10,000 test instances, and 10,000 validation instances, with training proceeding until full convergence, typically around 200 epochs. On CIFAR-10, we use 50,000 training instances, 10,000 test instances, and 10,000 validation instances, with models trained for up to 300 epochs. For CIFAR-100, we similarly use 50,000 training examples spread across 100 classes (500 per class), and a validation set of 10,000 examples (100 per class). The SVHN dataset comprises 73,257 training images across 10 classes with variable class frequencies, and a validation set of 26,032 images distributed proportionally. Finally, for TINYIMAGENET, we use 100,000 training images across 200 classes (500 per class), and a validation set of 10,000 images (50 per class), covering the same label space as the training data.
Hardware Specification	Yes	All vision-related experiments are conducted using NVIDIA 3 A6000 GPUs, while large language model (LLM) experiments are performed on 8 H100 GPUs to ensure fair comparisons with all baselines.
Software Dependencies	Yes	All experiments were conducted using Python 3.10.13 and Py Torch 2.1.2.
Experiment Setup	Yes	We use batch size of 16 and use 2 random validation points for computing the reward utility. We select 50% of the batch data for gradient updates during each step. ... The data module used a batch size of 128, with four workers for data loading. The model architecture employed was Res Net18 [15], and the training followed a curriculum-based mode, progressively utilizing 10%, 30%, and 50% of the training data. The optimizer used was SGD with a learning rate of 0.05, momentum of 0.9, weight decay of 0.0005, and Nesterov momentum enabled.