Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TANDEM: Bi-Level Data Mixture Optimization with Twin Networks

Authors: Jiaxing Wang, Deping Xiang, Jin Xu, Mingyang Yi, Guoqiang Gong, Zicheng Zhang, Haoran Li, Pengzhang Liu, Zhen Chen, Ke Zhang, Ju Fan, Qixia Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate TANDEM s effectiveness in all scenarios.
Researcher Affiliation	Collaboration	1JD.com 2 University of Oxford 3Renmin University of China 4University of Chinese Academy of Sciences
Pseudocode	Yes	Algorithm 1 Twin Networks for bi-level Dat A mixtur E opti Mization (TANDEM) Train set Dtrain, validation set Dval comprised of M domains. Episode number T, Episode length E, Probing length for each episode K, Learning rate ηw, ηu, ηα for w (reference), u (proxy) and α (mixture) respectively.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The code will be released upon the compliance review process of the company.
Open Datasets	Yes	Data-Abundant Scenario: For the data-aboundant scenario, we train 160M GPT-style LMs [1] on a 6B sampled version of Slim Pajama [31] as in [2]. Slim Pajama consists of 7 domains: Ar Xiv, Books, Common Crawl, C4, Github, Stack Exchange, and Wikipedia. The statistics of this sampled corpus are given in Figure 4. ... Supervised Fine-tuning: For supervised fine-tuning, we use 6 major categories (containing 99 tasks) from Natural Instructions [25, 35].
Dataset Splits	No	By splitting the data into training and validation sets, we can construct the validation and training loss Lm val(w) and Lm train(w) on domain Dm.
Hardware Specification	Yes	Experiments are conducted on eight NVIDIA Hopper H-100s.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	Data-Abundant Scenario: For the data-aboundant scenario, we set E = 20, K = 5, train with batch size 8 and context length 2048 for 40000 steps (with respect to updates of proxy model u, so the mixture ratio α is updated for 2000 steps.) as [2]. Though the Slim Pajama-6B corpus exhibits significant domain imbalance, 40K steps of training doesn t deplete even the smallest domain, so this setting constitutes a data-abundant one-epoch scenario. The penalty constant γ is set to 1 across all the experiments. ... Data-Restricted Scenario: In this scenario, we train with K = E = 5, batch size 128, and context length 512 for 5000 steps... Supervised Fine-tuning: In this scenario, we train a pretrained Qwen2-500M model [40] with K = 10, E = 10, batch size 64, and context length 512 for 5000 steps. ... Table 6: Hyper-parameters of TANDEM for different application scenarios