Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning

Authors: Sheng Wang, Pengan CHEN, Jingqi Zhou, Qintong Li, Jingwei Dong, Jiahui Gao, Boyang XUE, Jiyue Jiang, Lingpeng Kong, Chuan Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, extensive experiments across diverse benchmarks consistently demonstrate the superior data diversity, model performance, and robust scalability of TREESYNTH compared to both humancrafted datasets and peer data synthesis methods, with an average performance gain reaching 10%. ... Extensive experiments with both opensource and closed-source models across diverse benchmarks, spanning mathematical reasoning, code generation and psychology, demonstrate that TREESYNTH consistently achieves the best downstream performance with superior data diversity compared to both human-crafted datasets and peer data synthesis methods, with the average performance enhancement reaching 10%, underscoring its great effectiveness and generalization. ... 4 Experiments
Researcher Affiliation	Academia	Sheng Wang , Pengan Chen , Jingqi Zhou , Qintong Li, Jingwei Dong The University of Hong Kong EMAIL Jiahui Gao The University of Hong Kong EMAIL Boyang Xue, Jiyue Jiang The Chinese University of Hong Kong EMAIL, EMAIL Lingpeng Kong, Chuan Wu The University of Hong Kong EMAIL
Pseudocode	Yes	We also present the pseudo code to formularize the whole process in Algorithm 1. ... A.1 Pseudo Code of TREESYNTH The pseudo code of TREESYNTH is presented in Algorithm 1.
Open Source Code	Yes	The code is available at https://github.com/cpa2001/Tree Synth. ... We submitted our code to demonstrate the reproducibility of our method.
Open Datasets	Yes	For data synthesis, we first apply standard mathematical reasoning and code generation tasks, including GSM8K [20], MATH [21], MBPP [22] and Human Eval [23], to assess TREESYNTH s data diversity, model performance improvement, and scalability. Besides, we employ Simple To M [24], a psychological task, to further examine TREESYNTH s effectiveness in promoting data balance.
Dataset Splits	Yes	GSM8K evaluates mathematical reasoning capabilities through 8,500 high-quality grade school math problems developed via human expert annotation. The dataset is partitioned into 7,500 training and 1,000 test problems... MATH dataset comprises 12,500 competition-level mathematics problems, with 7,500 designated for training and 5,000 for testing... Vanilla Data denotes the original GSM8K and MATH training sets, and the Code Alpaca Python subset for Human Eval and MBPP.
Hardware Specification	Yes	A.8 Experiments Compute Resources All experiments are executed on high-performance computing node equipped with eight NVIDIA H100 SXM GPUs (80 GB HBM3 each), dual-socket 128-core CPUs, and 2 TB of system RAM.
Software Dependencies	Yes	The software stack comprised Py Torch 2.6.0 linked against CUDA 12.1 (NCCL 2.17.1).
Experiment Setup	Yes	A.4 Experiments Details ... Model Training. To fine-tune our selected base models (i.e., LLa MA3.1-8B and Qwen2.5-7B), we employ the parameter-efficient fine-tuning method Lo RA [61 63]. Specifically, we uniformly set the lora_dropout = 0, weight_decay = 0.1, and trained each model for 5 epochs. For GSM8K-style data, we set the learning rate to 1 10 4. For MATH, Code Alpaca, and Simple To M-style data, we set the learning rate to 1 10 5 during training 7. Empirical observations show that these configurations consistently achieves stable and competitive downstream performance across various tasks. 7These hyperparameters are selected based on a comprehensive grid search over candidate values: learning rate {1 10 6, 5 10 6, 1 10 5, 5 10 5, 1 10 4, 5 10 4}, lora_dropout {0, 0.05}, weight_decay {0, 0.1}, and epoch count {3, 5, 7, 10}.