Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Authors: Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through largescale empirical analyses spanning over 300 training runs, carefully controlled for data scale and quality, we show that data diversity can be a strong predictor of generalization in LLM reasoning as measured by average model performance on unseen out-of-distribution benchmarks.
Researcher Affiliation Collaboration 1NVIDIA Research 2University of Washington 3University of Southern California
Pseudocode Yes Algorithm 1 Higher Diversity Sampling Input: Data representation D R|D| d, number of clusters k and target subset size Ntarget Output: Indices of selected subset S {1, , |D|} S Initialize the subset. {c1, , ck} = K-means(D) Cluster data. ci is a set of indices corresponding to cluster i. while |S| < Ntarget do c random-sample({c1, , ck}, 1) Randomly pick a sampling cluster. Snew random-sample c, l Ntarget k m Sample new samples from the chosen cluster. S Snew Add new samples to the subset. return S Return the sampled subset. Algorithm 2 Lower Diversity Sampling Input: Data representation D R|D| d, seed set size Nseed, batch size Nbatch, target subset size Ntarget, similarity threshold τ Output: Indices of selected subset S {1, , |D|} S random-sample({1, , |D|} Initialize the subset with seed data points. while |S| < Ntarget do Snew {i {1, |D|} \ S | maxj S cos-sim(Di, Dj) > τ} Find samples that are similar to the current subset members. Snew random-sample(Snew, Nbatch) S S Snew Add new samples to the subset. return S Return the sampled subset.
Open Source Code Yes The code is provided as part of the supplementary material, along with anonymized link for the data samples from Prism Math and Prism NLI.
Open Datasets Yes For seed datasets, we use WANLI [28] for NLI, and a mixture of GSM8k [4] and MATH [15] for math reasoning.
Dataset Splits No No explicit train/validation/test splits are provided for the primary datasets (Prism Math, Prism NLI) or the initially generated 1.5M sample data pool used for model training. The paper describes sampling 300 distinct subsets of varying sizes (N = 100k, 50k, 10k) for evaluating diversity measures, but not specific train/validation/test partitions for the models presented as main results. Models are evaluated on external OOD benchmarks.
Hardware Specification Yes In practice, we parallelize the training of distinct models over up to 8 H100 nodes. We used 4 H100 nodes for fine-tuning models on our data.
Software Dependencies No The paper mentions specific models and evaluation tools, such as 'Qwen2.5-72B-Instruct', 'Llama-3.2-1B', 'Deberta-v3-large', 'Hugging Face lighteval', and 'math-verify', but does not provide version numbers for general software dependencies like programming languages, deep learning frameworks (e.g., Python, PyTorch, TensorFlow), or CUDA.
Experiment Setup No The paper does not explicitly provide details on specific training hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings for the models. While evaluation parameters like 'temp = 0.6 and top-p = 0.95' are mentioned for solution generation, the training configuration remains unspecified.