Language Models Can Teach Themselves to Program Better
Authors: Patrick Haluptzok, Matthew Bowers, Adam Tauman Kalai
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments on publicly-available LMs, test accuracy more than doubles. This work demonstrates the potential for code LMs, with an interpreter, to generate instructive problems and improve their own performance. |
| Researcher Affiliation | Collaboration | Patrick Haluptzok Microsoft Research haluptzok@live.com Matthew Bowers MIT mlbowers@mit.edu Adam Tauman Kalai Microsoft Research adam@kal.ai |
| Pseudocode | No | The paper describes a "Self-improvement Pipeline" with numbered steps, but these steps are presented as descriptive paragraphs rather than structured pseudocode or an algorithm block. |
| Open Source Code | Yes | Second, we release datasets of 1M synthetic puzzles and solutions along with the source code for our work2. https://Git Hub.com/microsoft/Python Programming Puzzles in ICLR2023 directory |
| Open Datasets | Yes | The open-source P3 dataset of Python Programming Puzzles demonstrates that programming puzzles can capture this wide range of challenges from various domains, from trivial string manipulation to longstanding open problems in algorithms and mathematics. Our work uses the P3 puzzles but not their solutions. Second, we release datasets of 1M synthetic puzzles and solutions along with the source code for our work2. https://Git Hub.com/microsoft/Python Programming Puzzles in ICLR2023 directory |
| Dataset Splits | No | The paper specifies a train/test split (155 puzzles for train and 228 for test) but does not mention a separate validation split for model training or evaluation. The term "validation" is used in the context of verifying solutions, not dataset splits. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types) used for running the experiments. It mentions using the Codex API and GPT-Neo models, but not the underlying hardware. |
| Software Dependencies | No | The paper mentions using a "Python interpreter" and the "scikit-learn" library, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | The fixed temperature of 0.9 from prior work [Schuster et al., 2021] was used in all puzzle-solving for generating fine-tuning data, where temperature of 0.8 was used for testing the fine-tuned model per Chen et al. [2021]. Each of the 3 Neo model sizes was fine-tuned for 1 epoch (1 pass through the generated data) using each of the 4 different datasets of 1M synthetic verified puzzle-solution pairs, yielding 12 fine-tuning runs. |