Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Authors: Changxin Ke, Rui Zhang, Shuo Wang, Li Ding, Guangli Li, Yuanbo Wen, Shuoming Zhang, Ruiyuan Xu, Jin Qin, Jiaming Guo, Chenxi Wang, Ling Li, Qi Guo, Yunji Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Qi Meng-Mu Pa significantly enhances the performance of the base models: when applied to Qwen2.5-Coder, it not only improves Pass@1 by up to 28.91% and boosts Tester performance by 68.90%, but also outperforms the previous state-of-the-art method Code Rosetta by 1.56 and 6.92 in BLEU and Code BLEU scores, while achieving performance comparable to Deep Seek-R1 and GPT-4.1. Our code is available at https://github.com/kcxain/mupa.
Researcher Affiliation	Academia	1 State Key Lab of Processors, Institute of Computing Technology, CAS 2 University of Chinese Academy of Sciences 3 Institute of Microelectronics, CAS 4 Intelligent Software Research Center, Institute of Software, CAS
Pseudocode	No	The paper describes the methodology with two steps, Co-verify and Co-evolve, and provides an overview diagram in Figure 1, but it does not contain a formal pseudocode block or algorithm labeled as such.
Open Source Code	Yes	Our code is available at https://github.com/kcxain/mupa.
Open Datasets	Yes	Unpaired training set. We filter the unaligned training set in Babel Tower [Wen et al., 2022] which contains 501,732 C functions and 129, 497 CUDA kernel functions. ... Paired test sets. The validation set and test set in Babel Tower [Wen et al., 2022] consist of 364 pairs of C and CUDA functions.
Dataset Splits	Yes	We filter the unaligned training set in Babel Tower [Wen et al., 2022] which contains 501,732 C functions and 129, 497 CUDA kernel functions. Considering these functions may cannot be executed due to calls to third-party libraries or user-defined functions, we filtered these functions through compilation, obtaining 14,687 valid C functions and 28,756 valid CUDA kernel functions as our training set. The validation set and test set in Babel Tower [Wen et al., 2022] consist of 364 pairs of C and CUDA functions. We use GPT-4 [Open AI et al., 2024] to generate unit tests for each pair, and ultimately filter out 233 pairs after compilation, each with 5 unit tests.
Hardware Specification	Yes	All executions in our experiments (e.g., Co-verify and evaluation) are conducted on a CPU (Intel i9-14900KF) and a GPU (RTX 4090 with 128 SMs).
Software Dependencies	Yes	We use g++ 9.3.0 to compile C programs and nvcc 12.1 for CUDA programs.
Experiment Setup	Yes	For Fine-tuning, we adopt the same hyper-parameters as existing supervised fine-tuning (SFT) methods [Zheng et al., 2024] for most models: learning rate of 1.0 10 5, cosine learning rate scheduler, warmup ratio of 0.1, and batch size of 32 in total. Additionally, we use the same chat template as in the instruct fine-tuning phase of Llama3 or Qwen2.5.