Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Authors: Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that models trained on PROX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, Red Pajama-V2, Fine Web, Fine Web Edu, and DCLM). |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2Generative AI Research Lab (GAIR) 3Sea AI Lab 4Shanghai Artificial Intelligence Laboratory. Correspondence to: Pengfei Liu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Document Chunk Splitting Algorithm |
| Open Source Code | No | The paper mentions using third-party open-source codebases like Lit GPT, Tiny Llama, llama-factory, and vllm. However, it does not provide any explicit statement or link for the authors' own implementation code for the PROX methodology described in this paper. |
| Open Datasets | Yes | For the general domain, we begin with Red Pajama-V2 (Together, 2023), a preprocessed large-scale dataset... We further apply PROX on the C4 corpus (Raffel et al., 2020)... and the recent high quality datasets including Fine Web (as well as Fine Web-Edu) (Penedo et al., 2024a) and DCLM (Li et al., 2024). For specific domain experiments, we use Open Web Math (Paster et al., 2024)... |
| Dataset Splits | Yes | Finally, we use LLAMA-3-70B-INSTRUCT to annotate 51K data, splitting 5K for validation. |
| Hardware Specification | Yes | Such 2-stage synthesis requires approximately 192 A100 GPU hours for processing 60B tokens of data. |
| Software Dependencies | No | The paper mentions using Lit GPT (AI, 2023), Tiny Llama (Zhang et al., 2024b), Flash Attention (Dao, 2024), llama-factory (Zheng et al., 2024) and vllm (Kwon et al., 2023) but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We apply full parameter supervised fine-tuning on our base models: we train on the whole seed dataset for 3 to 5 epochs, with batch size as 64, and cosine learning rate schedular (lr from 1e-5 – 1e-6)... Table 10: Training hyper-parameters of all base models. |