reproducibilityindex.ai

LILO: Learning Interpretable Libraries by Compressing and Documenting Code

Authors: Gabriel Grand, Lionel Wong, Matthew Bowers, Theo X. Olausson, Muxin Liu, Joshua B. Tenenbaum, Jacob Andreas

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing methods including the state-of-the-art library learning algorithm Dream Coder LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge.
Researcher Affiliation	Academia	1MIT CSAIL 2MIT Brain and Cognitive Sciences 3Harvey Mudd College
Pseudocode	Yes	Algorithm 1 Library learning loop with LILO
Open Source Code	Yes	Code for this paper is available at: github.com/gabegrand/lilo.
Open Datasets	Yes	We evaluate LILO against a language-guided Dream Coder variant on three challenging program synthesis domains: string editing with regular expressions (Andreas et al., 2018), scene reasoning on the CLEVR dataset (Johnson et al., 2017), and graphics composition in the 2D Logo turtle graphics language (Abelson & di Sessa, 1986).
Dataset Splits	No	Performance is measured as the percentage of tasks solved from an i.i.d. test set. (Only train and test splits are explicitly mentioned, not validation).
Hardware Specification	Yes	We ran all experiments on AWS EC2 instances with machine specs tailored to suit the computational workload of each experiment. For experiments involving enumerative search... we ran on 96-CPU c5.24xlarge instances... these experiments are run on c5.2xlarge machines with 8 CPUs each.
Software Dependencies	No	The paper mentions specific LLM models used (Open AI's Codex model (code-davinci-002), gpt-3.5-turbo, gpt-4) and the Stitch Python bindings, but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or Rust.
Experiment Setup	Yes	Appendix A.5: HYPERPARAMETERS... Batch size: 96 tasks, Global iterations: 10 (CLEVR, LOGO), 16 (REGEX), Search timeouts: 600s (CLEVR), 1000s (REGEX), 1800s (LOGO)... Prompts per task: 4, Samples per prompt: 4, GPT Model: code-davinci-002, Temperature: 0.90, Max completion tokens β: 4.0x... Max usage examples: 10, GPT Model: gpt-3.5-turbo-0301 / gpt-4-0314, Top-P: 0.10, Max completion tokens: 256