LILO: Learning Interpretable Libraries by Compressing and Documenting Code

Authors: Gabriel Grand, Lionel Wong, Matthew Bowers, Theo X. Olausson, Muxin Liu, Joshua B. Tenenbaum, Jacob Andreas

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing methods including the state-of-the-art library learning algorithm Dream Coder LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge.
Researcher Affiliation Academia 1MIT CSAIL 2MIT Brain and Cognitive Sciences 3Harvey Mudd College
Pseudocode Yes Algorithm 1 Library learning loop with LILO
Open Source Code Yes Code for this paper is available at: github.com/gabegrand/lilo.
Open Datasets Yes We evaluate LILO against a language-guided Dream Coder variant on three challenging program synthesis domains: string editing with regular expressions (Andreas et al., 2018), scene reasoning on the CLEVR dataset (Johnson et al., 2017), and graphics composition in the 2D Logo turtle graphics language (Abelson & di Sessa, 1986).
Dataset Splits No Performance is measured as the percentage of tasks solved from an i.i.d. test set. (Only train and test splits are explicitly mentioned, not validation).
Hardware Specification Yes We ran all experiments on AWS EC2 instances with machine specs tailored to suit the computational workload of each experiment. For experiments involving enumerative search... we ran on 96-CPU c5.24xlarge instances... these experiments are run on c5.2xlarge machines with 8 CPUs each.
Software Dependencies No The paper mentions specific LLM models used (Open AI's Codex model (code-davinci-002), gpt-3.5-turbo, gpt-4) and the Stitch Python bindings, but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or Rust.
Experiment Setup Yes Appendix A.5: HYPERPARAMETERS... Batch size: 96 tasks, Global iterations: 10 (CLEVR, LOGO), 16 (REGEX), Search timeouts: 600s (CLEVR), 1000s (REGEX), 1800s (LOGO)... Prompts per task: 4, Samples per prompt: 4, GPT Model: code-davinci-002, Temperature: 0.90, Max completion tokens β: 4.0x... Max usage examples: 10, GPT Model: gpt-3.5-turbo-0301 / gpt-4-0314, Top-P: 0.10, Max completion tokens: 256