Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ReGAL: Refactoring Programs to Discover Generalizable Abstractions
Authors: Elias Stengel-Eskin, Archiki Prasad, Mohit Bansal
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On five datasets LOGO graphics generation, Date reasoning, Text Craft (a Minecraft-based text-game) MATH, and Tab MWP both open-source and proprietary LLMs improve in accuracy when predicting programs with REGAL functions. |
| Researcher Affiliation | Academia | Elias Stengel-Eskin * 1 Archiki Prasad * 1 Mohit Bansal 1 1UNC Chapel Hill. |
| Pseudocode | Yes | Algorithm 1 REGAL: Training Algorithm; Algorithm 2 REGAL: Testing Algorithm |
| Open Source Code | Yes | Code: https://github.com/esteng/regal_program_learning. |
| Open Datasets | Yes | We explore five datasets: LOGO (Ellis et al., 2021; Wong et al., 2021), a program induction task; a date reasoning task (Srivastava et al., 2022) known to challenge LLMs (Suzgun et al., 2022); Text Craft (Prasad et al., 2023), a text-based game for crafting Minecraft objects; a subset of MATH (Hendrycks et al., 2021)... and Tab MWP (Lu et al., 2022)... |
| Dataset Splits | Yes | We use the small train/test splits (200/111) from Wong et al. (2021) and take 100 dev examples from the large train set. ... Specifically, we split their predicted programs from GPT-3.5 into train, dev, and test splits (66/113/180) ... giving us a train/dev/test split of 190/50/77. ... This gives us a train/dev/test split of 194/61/74. ... This gives us a train/dev/test split of 194/60/74. |
| Hardware Specification | No | The paper mentions various LLMs used (e.g., Code Llama, GPT-3.5, Lemur) but does not provide any specific details about the hardware (e.g., GPU, CPU models, memory) used to run the experiments. |
| Software Dependencies | Yes | For GPT-3.5, we use the gpt-3.5-turbo version (0613). All Code Llama models use the Code Llama-Instruct-hf versions, and we use the lemur-70b-v1 version of Lemur. |
| Experiment Setup | Yes | We use the dev set to select hyperparameter values, reported in Appendix C. All prompts can be found in Appendix D. ... Table 9 lists the refactoring and testing hyperparameters used for each domain. Setting LOGO Date Text Craft Rounds of refactoring 3 1 1 edit Every 5 5 5 prune Every 5 5 5 Add comments True False False Batch size 5 3 4 Filtering threshold 0.0 0.0 0.0 Filter before testing True True False ICL budget ratio 0.5 0.5 0.5 |