Automated Statistical Model Discovery with Language Models
Authors: Michael Y. Li, Emily Fox, Noah Goodman
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method in three settings in probabilistic modeling: searching within a restricted space of models, searching over an open-ended space, and improving expert models under natural language constraints (e.g., this model should be interpretable to an ecologist). Our method identifies models on par with human expert designed models and extends classic models in interpretable ways. Our results highlight the promise of LM-driven model discovery. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Stanford University 2Department of Statistics, Stanford University 3Chan Zuckerberg Biohub San Francisco 4Department of Psychology, Stanford University. |
| Pseudocode | Yes | Algorithm 1 Automated Model Discovery with LMs input dataset D, number of rounds T, k number of exemplars, m number of proposals per round, (optional) warm-start example z0, function for scoring a program score, (optional) function for producing natural language feedback criticize Z while t < T do {zt i}m i=1 q LM( |Z, z0, ht, D) {si}m i=1 score-all(score, {zt i}m i=1, D) Z select-exemplars(k, {zt i}m i=1, {si}m i=1) ht+1 criticize({zt i}m i=1, {si}m i=1, ht) end while |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of their own methodology. |
| Open Datasets | Yes | We consider four real world datasets from the Stan Posterior DB dataset (Magnusson et al., 2023) |
| Dataset Splits | No | The paper mentions 'training datapoints' and 'held-out test data' for different experiments but does not provide specific train/validation/test dataset splits (e.g., percentages or counts) or a detailed splitting methodology applicable across all experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments. |
| Software Dependencies | Yes | In our experiments, we use GPT-4 V (Achiam et al., 2023) (gpt4-11-06-preview), which has multimodal capabilities. We leverage pymc (Abril-Pla et al., 2023), a Python probabilistic programming library. We use diffrax (Kidger, 2021), a Jax-based ODE library. |
| Experiment Setup | Yes | We run our pipeline for two rounds with three proposals each round. We use a temperature of 0.2 for the Proposal LM and temperature of 0.0 for the Critic LM. We use three in-context exemplars." and "We run a hyperparameter search over four widths (4, 8, 16, 32) and 3 depths (1,2,4). We use a learning rate of 3e-3 and train using full-batch gradient descent with Adam for 1500 iterations. |