Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

Authors: Annabelle Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the REASONING COMPILER and compare its improvements and sample efficiency with TVM, which employs evolutionary search. Results show that the REASONING COMPILER consistently achieves significantly higher speedups than what TVM achieves using significantly fewer samples. On five representative benchmarks (Llama-3-8B Attention Layer, Deep Seek-R1 Mo E Layer, FLUX Attention Layer, FLUX Convolution Layer, and Llama-4-Scout MLP Layer) and across five hardware platforms (Amazon Graviton2, AMD EPYC 7R13, Apple M2 Pro, Intel Core i9, and Intel Xeon E3), the REASONING COMPILER achieves 5.0 average speedup using 5.8 fewer samples, resulting in an average of 10.8 improvement over TVM in sample efficiency.
Researcher Affiliation	Academia	Annabelle Sujun Tang University of California San Diego EMAIL Christopher Priebe University of California San Diego EMAIL Rohan Mahapatra University of California San Diego EMAIL Lianhui Qin University of California San Diego EMAIL Hadi Esmaeilzadeh University of California San Diego EMAIL
Pseudocode	No	The paper includes figures (Figure 1: Overview of the optimization workflow; Figure 2: Structured tree search) that illustrate the process. It also describes prompt construction and the MCTS steps. However, it does not contain a dedicated pseudocode block or algorithm section with structured, code-like steps (e.g., using keywords like 'for', 'if', 'return') typically associated with pseudocode.
Open Source Code	Yes	Code is available at https://github.com/he-actlab/REASONING_COMPILER
Open Datasets	Yes	We evaluate the REASONING COMPILER on five representative computational kernels drawn from production-scale models: (1) a self-attention layer from Llama-3-8B [17], (2) a mixture-of-experts (Mo E) layer from Deep Seek-R1 [18], (3) a self-attention layer from FLUX (stable diffusion) [19], (4) a convolution layer from FLUX [19], and (5) an MLP layer from Llama-4-Scout [20]. In addition, we perform an end-to-end evaluation of Llama-3-8B.
Dataset Splits	No	The paper describes experiments in compiler optimization, where the 'samples' refer to evaluated transformation proposals or configurations explored in a search space, not to data points in a dataset that is split into training, validation, and test sets. Therefore, the concept of dataset splits in the typical machine learning sense is not applicable here.
Hardware Specification	Yes	Our experimental environment is a dedicated Intel Core i9 workstation under a fixed software and hardware stack to isolate scheduling effects. This environment covers all five kernels above and is the ablation environment. To show portability and scalability across consumer and datacenter processors, we evaluate each of the five kernels on five hardware platforms: Amazon Graviton2, AMD EPYC 7R13, Apple M2 Pro, Intel Core i9, and Intel Xeon E3.
Software Dependencies	Yes	All experiments are conducted using Apache TVM v0.20.0 [10, 24].
Experiment Setup	Yes	Compiler optimization is framed as a sequential decision process and guided by MCTS [9] using the Upper Confidence bounds applied to Trees (UCT) criterion [12] with exploration parameter c = 2 and branching factor B = 2 following prior work [21, 22]. During search, the LLM (Open AI GPT-4o mini [23]) is queried using hierarchical context specifically, the parent and grandparent schedules and their transformations to enable informed proposal generation.