CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Authors: Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on Human Eval show no improvements on our benchmark.
Researcher Affiliation Collaboration 1AI at Meta 2MIT.
Pseudocode No The paper describes steps for benchmark construction in prose (e.g., "At a high level, our benchmark is constructed as follows.") but does not include any formally structured pseudocode blocks or algorithms.
Open Source Code No The paper explicitly states that CRUXEVAL is a "publicly available benchmark" (a dataset), but it does not provide an explicit statement or link for the source code of the *methodology* used to create or process this benchmark.
Open Datasets Yes To our knowledge, our CRUXEVAL is the first publicly available benchmark to measure the execution ability of code LMs.
Dataset Splits No The paper mentions "train accuracy" and "test accuracy" during fine-tuning, but it does not specify a separate "validation" split for the CRUXEVAL benchmark itself in the context of reproducing the experiment evaluations.
Hardware Specification No The paper mentions the use of various language models (e.g., GPT-4, Code Llama), but it does not specify the underlying hardware (e.g., specific GPU or CPU models, memory configurations) used to run the experiments.
Software Dependencies No While Appendix C lists Hugging Face URLs for model checkpoints, the paper does not provide specific version numbers for ancillary software dependencies such as programming languages (e.g., Python 3.x) or deep learning frameworks (e.g., PyTorch 1.x).
Experiment Setup Yes We use N = 100 samples for all non-GPT models and N = 10 samples for GPT models. We report both pass@1 scores (T = 0.2) and pass@5 scores (T = 0.8).