Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rethinking Verification for LLM Code Generation: From Generation to Testing
Authors: Zihan Ma, Taolin Zhang, Maosongcao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of Live Code Bench-v6. These results demonstrate the effectiveness of our proposed method. |
| Researcher Affiliation | Collaboration | 1Shanghai AI Laboratory 2School of Computer Science and Technology, Xi an Jiaotong University, China 3MOE KLINNS Lab, Xi an Jiaotong University, China {mazihan880}@stu.xjtu.edu.cn EMAIL {minnluo}@xjtu.edu.cn |
| Pseudocode | Yes | def gen_TC2(N=10): case = [] for i in range(1, N): case.append(f"{i} {i+1}") return f"{N}\n" + "\n".join(case) def validate_inputs(input_str): assert len(lines) >= 1, 'Missing N' assert 2 <= N <= 2 * 10**5, ' return True except Assertion Error as e: return False, str(e) |
| Open Source Code | Yes | 1The data demo and prompts can be accessed via https://github.com/open-compass/SAGA |
| Open Datasets | Yes | We construct TCGBench, a comprehensive dataset from competitive programming platform, to analyze existing Test Case Generation practices. TCGBench, a dataset we curated comprising 1840 recent programming problems from Atcoder, Codeforces, and Nowcoder, along with an average of 36.66 incorrect user submissions per problem. This rich resource (detailed in Appendix D) facilitates rigorous evaluation of TCG methodologies. |
| Dataset Splits | Yes | For focused main comparisons and ablation studies, we curated TCGBench-Lite, a challenging subset of 270 problems from At Coder, Codeforces, and Nowcoder contests since June 2024. This ensures contemporary relevance and minimizes potential data leakage. To further guarantee the impartiality of our results, particularly for the evaluation of our SAGA-distilled model TCGCoder7B, we employed a strict chronological split: the training data for TCGCoder-7B was sourced entirely from problems published before 2023. This temporal separation ensures the model is evaluated on genuinely unseen problems. TCGBench-Lite includes an average of 41.41 incorrect submissions (Swrong) per problem. Its difficulty distribution (Easy: 27.04%, Medium: 32.59%, Hard: 40.37%) was determined by platform tags and contest characteristics (details in Appendix E). |
| Hardware Specification | Yes | Training utilized Fully Sharded Data Parallel (FSDP) across 2 nodes, each with 8 GPUs. |
| Software Dependencies | Yes | TCGCoder-7B, our specialist 7-billion-parameter model for Test Case Generation (TCG), was finetuned from Qwen2.5-Coder-7B-Instruct [46, 18]. ... Key training configurations included ... and the qwen2 chat template. Training utilized Fully Sharded Data Parallel (FSDP) across 2 nodes, each with 8 GPUs. |
| Experiment Setup | Yes | Our analysis employ opensource LLMs: Deep Seek-V3-0324, Qwen2.5-72B-Instruct, and Qwen2.5-Coder-32B-Instruct with the greedy decoding strategy. Key training configurations included 3 epochs, a global batch size of 16, an initial learning rate of 5e-6 (minimum 3e-7), a max sequence length of 61,335 tokens, and the qwen2 chat template. |