Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Authors: Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida Wang
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on Human Eval show no improvements on our benchmark. |
| Researcher Affiliation | Collaboration | 1AI at Meta 2MIT. |
| Pseudocode | No | The paper describes steps for benchmark construction in prose (e.g., "At a high level, our benchmark is constructed as follows.") but does not include any formally structured pseudocode blocks or algorithms. |
| Open Source Code | No | The paper explicitly states that CRUXEVAL is a "publicly available benchmark" (a dataset), but it does not provide an explicit statement or link for the source code of the *methodology* used to create or process this benchmark. |
| Open Datasets | Yes | To our knowledge, our CRUXEVAL is the first publicly available benchmark to measure the execution ability of code LMs. |
| Dataset Splits | No | The paper mentions "train accuracy" and "test accuracy" during fine-tuning, but it does not specify a separate "validation" split for the CRUXEVAL benchmark itself in the context of reproducing the experiment evaluations. |
| Hardware Specification | No | The paper mentions the use of various language models (e.g., GPT-4, Code Llama), but it does not specify the underlying hardware (e.g., specific GPU or CPU models, memory configurations) used to run the experiments. |
| Software Dependencies | No | While Appendix C lists Hugging Face URLs for model checkpoints, the paper does not provide specific version numbers for ancillary software dependencies such as programming languages (e.g., Python 3.x) or deep learning frameworks (e.g., PyTorch 1.x). |
| Experiment Setup | Yes | We use N = 100 samples for all non-GPT models and N = 10 samples for GPT models. We report both pass@1 scores (T = 0.2) and pass@5 scores (T = 0.8). |