CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Authors: Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on Human Eval show no improvements on our benchmark. |
| Researcher Affiliation | Collaboration | 1AI at Meta 2MIT. |
| Pseudocode | No | The paper describes steps for benchmark construction in prose (e.g., "At a high level, our benchmark is constructed as follows.") but does not include any formally structured pseudocode blocks or algorithms. |
| Open Source Code | No | The paper explicitly states that CRUXEVAL is a "publicly available benchmark" (a dataset), but it does not provide an explicit statement or link for the source code of the *methodology* used to create or process this benchmark. |
| Open Datasets | Yes | To our knowledge, our CRUXEVAL is the first publicly available benchmark to measure the execution ability of code LMs. |
| Dataset Splits | No | The paper mentions "train accuracy" and "test accuracy" during fine-tuning, but it does not specify a separate "validation" split for the CRUXEVAL benchmark itself in the context of reproducing the experiment evaluations. |
| Hardware Specification | No | The paper mentions the use of various language models (e.g., GPT-4, Code Llama), but it does not specify the underlying hardware (e.g., specific GPU or CPU models, memory configurations) used to run the experiments. |
| Software Dependencies | No | While Appendix C lists Hugging Face URLs for model checkpoints, the paper does not provide specific version numbers for ancillary software dependencies such as programming languages (e.g., Python 3.x) or deep learning frameworks (e.g., PyTorch 1.x). |
| Experiment Setup | Yes | We use N = 100 samples for all non-GPT models and N = 10 samples for GPT models. We report both pass@1 scores (T = 0.2) and pass@5 scores (T = 0.8). |