reproducibilityindex.ai

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Authors: Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on Human Eval show no improvements on our benchmark.
Researcher Affiliation	Collaboration	1AI at Meta 2MIT.
Pseudocode	No	The paper describes steps for benchmark construction in prose (e.g., "At a high level, our benchmark is constructed as follows.") but does not include any formally structured pseudocode blocks or algorithms.
Open Source Code	No	The paper explicitly states that CRUXEVAL is a "publicly available benchmark" (a dataset), but it does not provide an explicit statement or link for the source code of the methodology used to create or process this benchmark.
Open Datasets	Yes	To our knowledge, our CRUXEVAL is the first publicly available benchmark to measure the execution ability of code LMs.
Dataset Splits	No	The paper mentions "train accuracy" and "test accuracy" during fine-tuning, but it does not specify a separate "validation" split for the CRUXEVAL benchmark itself in the context of reproducing the experiment evaluations.
Hardware Specification	No	The paper mentions the use of various language models (e.g., GPT-4, Code Llama), but it does not specify the underlying hardware (e.g., specific GPU or CPU models, memory configurations) used to run the experiments.
Software Dependencies	No	While Appendix C lists Hugging Face URLs for model checkpoints, the paper does not provide specific version numbers for ancillary software dependencies such as programming languages (e.g., Python 3.x) or deep learning frameworks (e.g., PyTorch 1.x).
Experiment Setup	Yes	We use N = 100 samples for all non-GPT models and N = 10 samples for GPT models. We report both pass@1 scores (T = 0.2) and pass@5 scores (T = 0.8).