reproducibilityindex.ai

ReGAL: Refactoring Programs to Discover Generalizable Abstractions

Authors: Elias Stengel-Eskin, Archiki Prasad, Mohit Bansal

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On five datasets LOGO graphics generation, Date reasoning, Text Craft (a Minecraft-based text-game) MATH, and Tab MWP both open-source and proprietary LLMs improve in accuracy when predicting programs with REGAL functions.
Researcher Affiliation	Academia	Elias Stengel-Eskin * 1 Archiki Prasad * 1 Mohit Bansal 1 1UNC Chapel Hill.
Pseudocode	Yes	Algorithm 1 REGAL: Training Algorithm; Algorithm 2 REGAL: Testing Algorithm
Open Source Code	Yes	Code: https://github.com/esteng/regal_program_learning.
Open Datasets	Yes	We explore five datasets: LOGO (Ellis et al., 2021; Wong et al., 2021), a program induction task; a date reasoning task (Srivastava et al., 2022) known to challenge LLMs (Suzgun et al., 2022); Text Craft (Prasad et al., 2023), a text-based game for crafting Minecraft objects; a subset of MATH (Hendrycks et al., 2021)... and Tab MWP (Lu et al., 2022)...
Dataset Splits	Yes	We use the small train/test splits (200/111) from Wong et al. (2021) and take 100 dev examples from the large train set. ... Specifically, we split their predicted programs from GPT-3.5 into train, dev, and test splits (66/113/180) ... giving us a train/dev/test split of 190/50/77. ... This gives us a train/dev/test split of 194/61/74. ... This gives us a train/dev/test split of 194/60/74.
Hardware Specification	No	The paper mentions various LLMs used (e.g., Code Llama, GPT-3.5, Lemur) but does not provide any specific details about the hardware (e.g., GPU, CPU models, memory) used to run the experiments.
Software Dependencies	Yes	For GPT-3.5, we use the gpt-3.5-turbo version (0613). All Code Llama models use the Code Llama-Instruct-hf versions, and we use the lemur-70b-v1 version of Lemur.
Experiment Setup	Yes	We use the dev set to select hyperparameter values, reported in Appendix C. All prompts can be found in Appendix D. ... Table 9 lists the refactoring and testing hyperparameters used for each domain. Setting LOGO Date Text Craft Rounds of refactoring 3 1 1 edit Every 5 5 5 prune Every 5 5 5 Add comments True False False Batch size 5 3 4 Filtering threshold 0.0 0.0 0.0 Filter before testing True True False ICL budget ratio 0.5 0.5 0.5