Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning
Authors: Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/ UCSC-REAL/Token Cleaning. [...] Comprehensive Experiments. We conduct extensive experiments across multiple tasks, demonstrating that our token cleaning pipeline consistently boosts performance over baselines and validates its practical merits. |
| Researcher Affiliation | Collaboration | 1University of California, Santa Cruz 2Northeastern University 3Docta.ai 4Hong Kong University of Science and Technology (Guangzhou) 5Hong Kong Baptist University. |
| Pseudocode | Yes | Algorithm 1 Token Cleaning Pipeline |
| Open Source Code | Yes | Code is available at https://github.com/ UCSC-REAL/Token Cleaning. |
| Open Datasets | Yes | Data Pool We utilize a high-quality data pool with 50k sample size from five popular SFT datasets (300k in total): Flan v2 (Longpre et al., 2023), Open Assistant 1 (K opf et al., 2024), Stanford Alpaca (Taori et al., 2023), Dolly (Databricks, 2023), and Wizard LM (Xu et al., 2023). |
| Dataset Splits | Yes | For the self-evolving cleaning strategy, we heuristically divide the data pool into five equally sized subsets (10k samples). [...] Algorithm 1 Token Cleaning Pipeline: 2: Split dataset e D into a series of subset { e D0, , e DT }. |
| Hardware Specification | Yes | All experiments are conducted on eight NVIDIA L40S GPUs. |
| Software Dependencies | No | The paper mentions applying the LoRA technique and using the lm-eval-harness repository, but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | Following the experimental setup (Wang et al., 2023), we apply the Lo RA technique (Hu et al., 2022) with a rank-size of 64 and a scaling factor of 16. The overall batch size is 48, with the learning rate at 1e-4 as well as 1 training epoch. By default, the maximum input length is 2048. |