Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

Authors: Hainan Fang, Yuanbo Wen, Jun Bi, Yihan Wang, Tonghui He, Yanlin Tang, Di Huang, Jiaming Guo, Rui Zhang, Qi Guo, Yunji Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities. Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code.
Researcher Affiliation	Academia	1State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China EMAIL
Pseudocode	No	The paper describes methods and processes (e.g., Section 4: Our Neural Compilation framework leverages a novel automatic prompt-learning mechanism...) and shows a pipeline diagram (Figure 1), but it does not include a formally structured pseudocode or algorithm block with numbered steps typically associated with an 'Algorithm' label.
Open Source Code	No	We plan to release our dataset and code upon acceptance; currently no public repository is linked. We will include a link in the final version.
Open Datasets	Yes	To this end, in this paper, we first introduce Neu Com Back, a novel benchmark dataset specifically designed for evaluating IR-to-assembly compilation. Derived from Exe Bench (Armengol-Estapé et al., 2022) and TSVC (Maleki et al., 2011), Neu Com Back provides a diverse set of programs to systematically assess fundamental compilation and optimization capabilities.
Dataset Splits	Yes	The Neu Com Back-L1 dataset was divided into 120 training samples, 40 for validation, and 40 for testing, while Neu Com Back-L2 comprised 101 training samples, 25 for validation, and 25 for testing.
Hardware Specification	No	The paper specifies the Large Language Models used (e.g., Deep Seek-R1, GPT-4o). It is clarified that model inference for these LLMs was performed via API calls. Consequently, the specific underlying compute hardware (e.g., GPU type, memory on the provider’s side) for the inference step is managed by the API providers.
Software Dependencies	No	The paper mentions using LLMs such as 'GPT-4o (Open AI, 2024b)', 'O3-Mini (Open AI, 2025)', 'O1 (Open AI, 2024a)', 'Deep Seek-V3 (Liu et al., 2024)', and 'Deep Seek-R1 (Guo et al., 2025)'. It also mentions 'clang' for compilation. However, specific version numbers for 'clang' or other ancillary software components (e.g., programming languages, libraries, operating systems used for running experiments) are not provided.
Experiment Setup	Yes	For both datasets, we conducted prompt learning over three epochs with a batch size of 5, additionally introducing 1 self-debugging round per generation for Neu Com Back-L1 and 2 rounds for Neu Com Back-L2. From this process, we selected the highestperforming prompt based on validation metrics (referred to as the "Learned Prompt"), which we compare against the baseline prompt detailed in Appendix B.