Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Adaptive Training Distributions with Scalable Online Bilevel Optimization

Authors: David Grangier, Pierre Ablin, Awni Hannun

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments focus on three application domains: language modeling, machine translation and image classification. Before introducing our experimental setup and discussing our results on each domain, we describe the baselines we considered. Table 2 reports the results of our language modeling experiments. Table 4 reports the results of our machine translation experiments. Table 5 reports the results of our image classification experiments. Figure 1: Specific loss as a function of time for the scaling experiment.
Researcher Affiliation	Industry	David Grangier, Pierre Ablin, Awni Hannun EMAIL Apple
Pseudocode	Yes	Algorithm 1 Scalable, Online Bilevel Data Selection Require: Dgeneric, Dspecific, bsmall, blarge Training datasets, batch sizes. θ0 main_model_initializer() α0 weight_model_initializer() for t = 1, . . . , T do Sample generic and specific batch. Bgeneric sample(Dgeneric, blarge) Bspecific sample(Dspecific, bsmall) Sample generic sub-batches. Bfiltered filter(Bgeneric, αt 1, bsmall) B generic sample(Bgeneric, bsmall) Inner and outer updates. θt update_main_model(Bfiltered, θt 1) αt update_weight_model(B generic, Bspecific, θt, αt 1) end for return θT Trained main model.
Open Source Code	Yes	1Code to reproduce our experiments is at https://github.com/apple/ml-bilevel-train-dist
Open Datasets	Yes	Our language modeling (LM) experiments relies on two datasets, the C4 dataset (Raffel et al., 2019) is used as the generic set and the RCV1 (Lewis et al., 2004) dataset is used as the specific set. Our machine translation (MT) experiments learn a translation model from English into German. They rely on two datasets: our generic set is the Paracrawl dataset (Release 1 for WMT 2018) with 36m sentence pairs (Bañón et al., 2020). Our specific set concatenates the WMT newstest sets (2009 2019) with source original English sentences, which amounts to 10,015 sentence pairs (Akhbardeh et al., 2021). Our vision setup performs contrastive training over image and captions CLIP (Radford et al., 2021) for generic training and image classification for specific training. As datasets, we rely on yfcc15m (Radford et al., 2021) for generic training (14.9m image/caption pairs) and Image Net67 (Eshed, 2020) dataset for the specific task.
Dataset Splits	Yes	Our specific set concatenates the WMT newstest sets (2009 2019) with source original English sentences, which amounts to 10,015 sentence pairs (Akhbardeh et al., 2021). We use the 2020 newstest data (1,997 sentences) as our validation set and leave the 2021 newstest data (1,418 sentences) as our test set. Image Net67... we consider a setup with limited specific data and take 2,010 specific examples, 30 per class, for training. Held-out evaluation is performed with 50 images per class.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	Our implementation is derived from the language model of the Flax library (Heek et al., 2020). Flax: A neural network library and ecosystem for jax, 2020. The paper mentions the Flax and Jax libraries with their publication year, but does not provide specific version numbers for these software dependencies (e.g., Flax 0.3.0) which are required for reproducibility.
Experiment Setup	Yes	Appendix C Hyper-parameters Table 13: Hyperparameters for the LM experiment Hyperparameter Value batch_size 128 dropout_rate 0.1 big_batch_size 16384 optimizer adam learning_rate 0.002 meta_learning_rate 0.001 meta_optimizer adam num_steps 300000 Table 14: Hyperparameters for the translation experiment Table 15: Hyperparameters for the vision experiment