Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
WildChat-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Authors: Benjamin Feuer, Chinmay Hegde
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github. com/penfever/wildchat-50m. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, New York University, New York City, USA. Correspondence to: Benjamin Feuer <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in narrative text and does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our dataset, samples and code are available at https://github. com/penfever/wildchat-50m. |
| Open Datasets | Yes | To close this gap and better understand the downstream effects of DGM choice on synthetic data quality, we develop WILDCHAT-50M, which is the largest and most diverse publicly available dataset of chat transcripts to date. We also show that WILDCHAT-50M is a particularly effective source for post-training data for LLMs. Our core contributions in this work are as follows: 1. We introduce WILDCHAT-50M, the largest publicly available dataset of chat transcripts. Our dataset consists a vast corpus of synthetically generated chat transcripts using 50 different open-weight models. |
| Dataset Splits | No | The paper specifies training sample sizes for SFT experiments (e.g., "Most of our experiments were conducted on SFT models trained on 250,000 (250k) conversations" and "We ablate the effect of data scaling at 100k, 250k and 500k samples"), but it does not provide explicit training/validation/test dataset splits from the WILDCHAT-50M dataset itself. Evaluation is performed on external benchmarks. |
| Hardware Specification | Yes | Our data collection process was conducted over a period of approximately two months on a 12x8 H100 shared research cluster. ... Each of our SFT runs utilizes one 4x H100 node. |
| Software Dependencies | No | All responses and judgments are generated using VLLM (Kwon et al., 2023)... We conduct our SFT experiments using a modified version of the Axolotl framework (Lian, 2025). The paper mentions software frameworks like VLLM and Axolotl but does not provide specific version numbers for these or other key software components. |
| Experiment Setup | Yes | We use the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 2e-5, a single epoch, and a cosine learning rate scheduler, with eight steps of gradient accumulation, in bf16 precision. We also utilize several techniques to optimize training speed, such as gradient checkpointing, flash attention, and in some cases, FSDP (full shard, autowrap). The base model trained is always llama 3.1 8B (for us and baselines). |