Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WildChat-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Authors: Benjamin Feuer, Chinmay Hegde

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github. com/penfever/wildchat-50m.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, New York University, New York City, USA. Correspondence to: Benjamin Feuer <EMAIL>.
Pseudocode No The paper describes the methodology in narrative text and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our dataset, samples and code are available at https://github. com/penfever/wildchat-50m.
Open Datasets Yes To close this gap and better understand the downstream effects of DGM choice on synthetic data quality, we develop WILDCHAT-50M, which is the largest and most diverse publicly available dataset of chat transcripts to date. We also show that WILDCHAT-50M is a particularly effective source for post-training data for LLMs. Our core contributions in this work are as follows: 1. We introduce WILDCHAT-50M, the largest publicly available dataset of chat transcripts. Our dataset consists a vast corpus of synthetically generated chat transcripts using 50 different open-weight models.
Dataset Splits No The paper specifies training sample sizes for SFT experiments (e.g., "Most of our experiments were conducted on SFT models trained on 250,000 (250k) conversations" and "We ablate the effect of data scaling at 100k, 250k and 500k samples"), but it does not provide explicit training/validation/test dataset splits from the WILDCHAT-50M dataset itself. Evaluation is performed on external benchmarks.
Hardware Specification Yes Our data collection process was conducted over a period of approximately two months on a 12x8 H100 shared research cluster. ... Each of our SFT runs utilizes one 4x H100 node.
Software Dependencies No All responses and judgments are generated using VLLM (Kwon et al., 2023)... We conduct our SFT experiments using a modified version of the Axolotl framework (Lian, 2025). The paper mentions software frameworks like VLLM and Axolotl but does not provide specific version numbers for these or other key software components.
Experiment Setup Yes We use the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 2e-5, a single epoch, and a cosine learning rate scheduler, with eight steps of gradient accumulation, in bf16 precision. We also utilize several techniques to optimize training speed, such as gradient checkpointing, flash attention, and in some cases, FSDP (full shard, autowrap). The base model trained is always llama 3.1 8B (for us and baselines).