reproducibilityindex.ai

Learning Performance-Improving Code Edits

Authors: Alexander G Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob R. Gardner, Yiming Yang, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, Amir Yazdanbakhsh

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we curate a dataset of performance-improving edits made by human programmers of over 77 K competitive C++ programming submission pairs, accompanied by extensive unit tests. A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious improvements . To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry. Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performanceconditioned generation and synthetic data augmentation based on self-play. A combination of these techniques achieves a mean speedup of 6.86 with eight generations, higher than average optimizations from individual programmers (3.66 ). Using our model s fastest generations, we set a new upper limit on the fastest speedup possible for our dataset at 9.64 compared to using the fastest human submissions available (9.56 ).
Researcher Affiliation	Collaboration	Alexander Shypula1 , Aman Madaan2 , Yimeng Zeng1, Uri Alon4, Jacob Gardner1, Milad Hashemi3, Graham Neubig2, Parthasarathy Ranganathan3, Osbert Bastani1, Amir Yazdanbakhsh4 1University of Pennsylvania, 2Carnegie Mellon University, 3Google 4Google Deep Mind shypula@seas.upenn.edu and ayazdan@google.com
Pseudocode	No	The paper includes C++ code examples in figures (e.g., Figures 3-7) and prompt templates (e.g., Figures 12-15) but does not contain a dedicated pseudocode section or a clearly labeled algorithm block for any specific method.
Open Source Code	Yes	Code: https://pie4perf.com
Open Datasets	Yes	Our dataset is constructed based on performance-improving edits (PIE) made by human programmers in a range of competitive programming tasks from Code Net (Puri et al., 2021).
Dataset Splits	Yes	We split the resulting dataset of pairs P into train/validation/test sets, ensuring that any particular competitive programming problem only appears in one of them. We obtain a training set of 77,967 pairs from 1,474 problems, a validation set of 2,544 pairs from 77 problems, and a test set of 978 pairs from 41 problems.
Hardware Specification	Yes	We fine-tuned the 7B and 13B variants using the Hugging Face Transformers library with FSDP to distribute the training process across 8 48GB GPUs (NVIDIA RTX A6000/NVIDIA L40). ... Parallelized on a 24-core Intel 13900k processor with 64GB of RAM, this took less than 72 hours to complete.
Software Dependencies	Yes	We compile all C++ programs with GCC version 9.3.0 and C++17 as well as the -O3 optimization flag; therefore, any reported improvements would be those on top of the optimizing compiler. We fine-tuned the 7B and 13B variants using the Hugging Face Transformers library... All experiments were conducted using the Adam W optimizer.
Experiment Setup	Yes	We use BEST@k to denote this strategy with k samples and a temperature of 0.7. ... For the 7B and 13B variants of CODELLAMA, we used a batch size of 32 and a learning rate of 1e-5 for all of the experiments. ... For tasks related to full data fine-tuning and performance-conditioned fine-tuning, we only train for 1 epoch...