reproducibilityindex.ai

Language Models Can Teach Themselves to Program Better

Authors: Patrick Haluptzok, Matthew Bowers, Adam Tauman Kalai

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments on publicly-available LMs, test accuracy more than doubles. This work demonstrates the potential for code LMs, with an interpreter, to generate instructive problems and improve their own performance.
Researcher Affiliation	Collaboration	Patrick Haluptzok Microsoft Research haluptzok@live.com Matthew Bowers MIT mlbowers@mit.edu Adam Tauman Kalai Microsoft Research adam@kal.ai
Pseudocode	No	The paper describes a "Self-improvement Pipeline" with numbered steps, but these steps are presented as descriptive paragraphs rather than structured pseudocode or an algorithm block.
Open Source Code	Yes	Second, we release datasets of 1M synthetic puzzles and solutions along with the source code for our work2. https://Git Hub.com/microsoft/Python Programming Puzzles in ICLR2023 directory
Open Datasets	Yes	The open-source P3 dataset of Python Programming Puzzles demonstrates that programming puzzles can capture this wide range of challenges from various domains, from trivial string manipulation to longstanding open problems in algorithms and mathematics. Our work uses the P3 puzzles but not their solutions. Second, we release datasets of 1M synthetic puzzles and solutions along with the source code for our work2. https://Git Hub.com/microsoft/Python Programming Puzzles in ICLR2023 directory
Dataset Splits	No	The paper specifies a train/test split (155 puzzles for train and 228 for test) but does not mention a separate validation split for model training or evaluation. The term "validation" is used in the context of verifying solutions, not dataset splits.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., GPU models, CPU types) used for running the experiments. It mentions using the Codex API and GPT-Neo models, but not the underlying hardware.
Software Dependencies	No	The paper mentions using a "Python interpreter" and the "scikit-learn" library, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	The fixed temperature of 0.9 from prior work [Schuster et al., 2021] was used in all puzzle-solving for generating fine-tuning data, where temperature of 0.8 was used for testing the fine-tuned model per Chen et al. [2021]. Each of the 3 Neo model sizes was fine-tuned for 1 epoch (1 pass through the generated data) using each of the 4 different datasets of 1M synthetic verified puzzle-solution pairs, yielding 12 fine-tuning runs.