Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReGAL: Refactoring Programs to Discover Generalizable Abstractions

Authors: Elias Stengel-Eskin, Archiki Prasad, Mohit Bansal

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On five datasets LOGO graphics generation, Date reasoning, Text Craft (a Minecraft-based text-game) MATH, and Tab MWP both open-source and proprietary LLMs improve in accuracy when predicting programs with REGAL functions.
Researcher Affiliation Academia Elias Stengel-Eskin * 1 Archiki Prasad * 1 Mohit Bansal 1 1UNC Chapel Hill.
Pseudocode Yes Algorithm 1 REGAL: Training Algorithm; Algorithm 2 REGAL: Testing Algorithm
Open Source Code Yes Code: https://github.com/esteng/regal_program_learning.
Open Datasets Yes We explore five datasets: LOGO (Ellis et al., 2021; Wong et al., 2021), a program induction task; a date reasoning task (Srivastava et al., 2022) known to challenge LLMs (Suzgun et al., 2022); Text Craft (Prasad et al., 2023), a text-based game for crafting Minecraft objects; a subset of MATH (Hendrycks et al., 2021)... and Tab MWP (Lu et al., 2022)...
Dataset Splits Yes We use the small train/test splits (200/111) from Wong et al. (2021) and take 100 dev examples from the large train set. ... Specifically, we split their predicted programs from GPT-3.5 into train, dev, and test splits (66/113/180) ... giving us a train/dev/test split of 190/50/77. ... This gives us a train/dev/test split of 194/61/74. ... This gives us a train/dev/test split of 194/60/74.
Hardware Specification No The paper mentions various LLMs used (e.g., Code Llama, GPT-3.5, Lemur) but does not provide any specific details about the hardware (e.g., GPU, CPU models, memory) used to run the experiments.
Software Dependencies Yes For GPT-3.5, we use the gpt-3.5-turbo version (0613). All Code Llama models use the Code Llama-Instruct-hf versions, and we use the lemur-70b-v1 version of Lemur.
Experiment Setup Yes We use the dev set to select hyperparameter values, reported in Appendix C. All prompts can be found in Appendix D. ... Table 9 lists the refactoring and testing hyperparameters used for each domain. Setting LOGO Date Text Craft Rounds of refactoring 3 1 1 edit Every 5 5 5 prune Every 5 5 5 Add comments True False False Batch size 5 3 4 Filtering threshold 0.0 0.0 0.0 Filter before testing True True False ICL budget ratio 0.5 0.5 0.5