Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How efficient is LLM-generated code? A rigorous & high-standard benchmark

Authors: Ruizhong Qiu, Weiliang Zeng, James Ezick, Christopher Lott, Hanghang Tong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	An extensive study across 30 popular LLMs using our benchmark ENAMEL shows that LLMs still fall short of generating expert-level efficient code. Using two subsets of our problem set, we demonstrate that such deficiency is because current LLMs struggle in designing advanced algorithms and are barely aware of implementation optimization.
Researcher Affiliation	Collaboration	University of Illinois Urbana Champaign Qualcomm AI Research EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Numerically stable c effi@k
Open Source Code	Yes	Our benchmark is publicly available at https://github.com/q-rz/enamel.
Open Datasets	Yes	We carefully select 142 problems out of the 164 problems in Human Eval (Chen et al., 2021) and Human Eval+ (Liu et al., 2023a), excluding trivial problems with Θ(1) time complexity.
Dataset Splits	Yes	For each problem i, each level l = 0, 1, . . . , L has Ml test cases. If the output of the code does not match the expected output in any test case or does not pass level 0, we will not count it into the pass@k metric. If the code passes level 0 but exceeds the time limit in some level l ≥ 1, we will still count it into the pass@k metric but will skip the remaining levels (i.e., we assume that it will also exceed the time limit for the remaining levels because the input scale increases with the level l). Finally, we compute its efficiency score according to §2.2. ... We use α = 2, R = 6, h1 = h2 = 3, h3 = 4, M0 = 8, M1 = M2 = M3 = 4.
Hardware Specification	Yes	For other open-source models, we use temperature 0.8 and top p 0.95 for sampling on a server with 8 NVIDIA A100 80GB GPUs. ... We run evaluation on virtualized cloud servers hosted by Google Cloud (Ubuntu 20.04.6 LTS; Intel Xeon CPU @ 2.20GHz; Python 3.10.12).
Software Dependencies	No	We run evaluation on virtualized cloud servers hosted by Google Cloud (Ubuntu 20.04.6 LTS; Intel Xeon CPU @ 2.20GHz; Python 3.10.12). The paper only lists Python 3.10.12 without any specific versioned libraries or solvers, which according to the criteria is not enough for a 'Yes'.
Experiment Setup	Yes	We use α = 2, R = 6, h1 = h2 = 3, h3 = 4, M0 = 8, M1 = M2 = M3 = 4. To minimize server workload fluctuations, we run evaluation on virtualized cloud servers hosted by Google Cloud (Ubuntu 20.04.6 LTS; Intel Xeon CPU @ 2.20GHz; Python 3.10.12).