Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LILO: Learning Interpretable Libraries by Compressing and Documenting Code
Authors: Gabriel Grand, Lionel Wong, Matthew Bowers, Theo X. Olausson, Muxin Liu, Joshua B. Tenenbaum, Jacob Andreas
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate LILO on three inductive program synthesis benchmarks for string editing, scene reasoning, and graphics composition. Compared to existing methods including the state-of-the-art library learning algorithm Dream Coder LILO solves more complex tasks and learns richer libraries that are grounded in linguistic knowledge. |
| Researcher Affiliation | Academia | 1MIT CSAIL 2MIT Brain and Cognitive Sciences 3Harvey Mudd College |
| Pseudocode | Yes | Algorithm 1 Library learning loop with LILO |
| Open Source Code | Yes | Code for this paper is available at: github.com/gabegrand/lilo. |
| Open Datasets | Yes | We evaluate LILO against a language-guided Dream Coder variant on three challenging program synthesis domains: string editing with regular expressions (Andreas et al., 2018), scene reasoning on the CLEVR dataset (Johnson et al., 2017), and graphics composition in the 2D Logo turtle graphics language (Abelson & di Sessa, 1986). |
| Dataset Splits | No | Performance is measured as the percentage of tasks solved from an i.i.d. test set. (Only train and test splits are explicitly mentioned, not validation). |
| Hardware Specification | Yes | We ran all experiments on AWS EC2 instances with machine specs tailored to suit the computational workload of each experiment. For experiments involving enumerative search... we ran on 96-CPU c5.24xlarge instances... these experiments are run on c5.2xlarge machines with 8 CPUs each. |
| Software Dependencies | No | The paper mentions specific LLM models used (Open AI's Codex model (code-davinci-002), gpt-3.5-turbo, gpt-4) and the Stitch Python bindings, but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or Rust. |
| Experiment Setup | Yes | Appendix A.5: HYPERPARAMETERS... Batch size: 96 tasks, Global iterations: 10 (CLEVR, LOGO), 16 (REGEX), Search timeouts: 600s (CLEVR), 1000s (REGEX), 1800s (LOGO)... Prompts per task: 4, Samples per prompt: 4, GPT Model: code-davinci-002, Temperature: 0.90, Max completion tokens β: 4.0x... Max usage examples: 10, GPT Model: gpt-3.5-turbo-0301 / gpt-4-0314, Top-P: 0.10, Max completion tokens: 256 |