Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Authors: Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida Wang

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on Human Eval show no improvements on our benchmark.
Researcher Affiliation Collaboration 1AI at Meta 2MIT.
Pseudocode No The paper describes steps for benchmark construction in prose (e.g., "At a high level, our benchmark is constructed as follows.") but does not include any formally structured pseudocode blocks or algorithms.
Open Source Code No The paper explicitly states that CRUXEVAL is a "publicly available benchmark" (a dataset), but it does not provide an explicit statement or link for the source code of the *methodology* used to create or process this benchmark.
Open Datasets Yes To our knowledge, our CRUXEVAL is the first publicly available benchmark to measure the execution ability of code LMs.
Dataset Splits No The paper mentions "train accuracy" and "test accuracy" during fine-tuning, but it does not specify a separate "validation" split for the CRUXEVAL benchmark itself in the context of reproducing the experiment evaluations.
Hardware Specification No The paper mentions the use of various language models (e.g., GPT-4, Code Llama), but it does not specify the underlying hardware (e.g., specific GPU or CPU models, memory configurations) used to run the experiments.
Software Dependencies No While Appendix C lists Hugging Face URLs for model checkpoints, the paper does not provide specific version numbers for ancillary software dependencies such as programming languages (e.g., Python 3.x) or deep learning frameworks (e.g., PyTorch 1.x).
Experiment Setup Yes We use N = 100 samples for all non-GPT models and N = 10 samples for GPT models. We report both pass@1 scores (T = 0.2) and pass@5 scores (T = 0.8).