LILO: Learning Interpretable Libraries by Compressing and Documenting Code
Authors: Gabriel Grand, Lionel Wong, Matthew Bowers, Theo X. Olausson, Muxin Liu, Joshua B. Tenenbaum, Jacob Andreas
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing methods including the state-of-the-art library learning algorithm Dream Coder LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge. |
| Researcher Affiliation | Academia | 1MIT CSAIL 2MIT Brain and Cognitive Sciences 3Harvey Mudd College |
| Pseudocode | Yes | Algorithm 1 Library learning loop with LILO |
| Open Source Code | Yes | Code for this paper is available at: github.com/gabegrand/lilo. |
| Open Datasets | Yes | We evaluate LILO against a language-guided Dream Coder variant on three challenging program synthesis domains: string editing with regular expressions (Andreas et al., 2018), scene reasoning on the CLEVR dataset (Johnson et al., 2017), and graphics composition in the 2D Logo turtle graphics language (Abelson & di Sessa, 1986). |
| Dataset Splits | No | Performance is measured as the percentage of tasks solved from an i.i.d. test set. (Only train and test splits are explicitly mentioned, not validation). |
| Hardware Specification | Yes | We ran all experiments on AWS EC2 instances with machine specs tailored to suit the computational workload of each experiment. For experiments involving enumerative search... we ran on 96-CPU c5.24xlarge instances... these experiments are run on c5.2xlarge machines with 8 CPUs each. |
| Software Dependencies | No | The paper mentions specific LLM models used (Open AI's Codex model (code-davinci-002), gpt-3.5-turbo, gpt-4) and the Stitch Python bindings, but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or Rust. |
| Experiment Setup | Yes | Appendix A.5: HYPERPARAMETERS... Batch size: 96 tasks, Global iterations: 10 (CLEVR, LOGO), 16 (REGEX), Search timeouts: 600s (CLEVR), 1000s (REGEX), 1800s (LOGO)... Prompts per task: 4, Samples per prompt: 4, GPT Model: code-davinci-002, Temperature: 0.90, Max completion tokens β: 4.0x... Max usage examples: 10, GPT Model: gpt-3.5-turbo-0301 / gpt-4-0314, Top-P: 0.10, Max completion tokens: 256 |