Automated Statistical Model Discovery with Language Models

Authors: Michael Y. Li, Emily Fox, Noah Goodman

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method in three settings in probabilistic modeling: searching within a restricted space of models, searching over an open-ended space, and improving expert models under natural language constraints (e.g., this model should be interpretable to an ecologist). Our method identifies models on par with human expert designed models and extends classic models in interpretable ways. Our results highlight the promise of LM-driven model discovery.
Researcher Affiliation Collaboration 1Department of Computer Science, Stanford University 2Department of Statistics, Stanford University 3Chan Zuckerberg Biohub San Francisco 4Department of Psychology, Stanford University.
Pseudocode Yes Algorithm 1 Automated Model Discovery with LMs input dataset D, number of rounds T, k number of exemplars, m number of proposals per round, (optional) warm-start example z0, function for scoring a program score, (optional) function for producing natural language feedback criticize Z while t < T do {zt i}m i=1 q LM( |Z, z0, ht, D) {si}m i=1 score-all(score, {zt i}m i=1, D) Z select-exemplars(k, {zt i}m i=1, {si}m i=1) ht+1 criticize({zt i}m i=1, {si}m i=1, ht) end while
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of their own methodology.
Open Datasets Yes We consider four real world datasets from the Stan Posterior DB dataset (Magnusson et al., 2023)
Dataset Splits No The paper mentions 'training datapoints' and 'held-out test data' for different experiments but does not provide specific train/validation/test dataset splits (e.g., percentages or counts) or a detailed splitting methodology applicable across all experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments.
Software Dependencies Yes In our experiments, we use GPT-4 V (Achiam et al., 2023) (gpt4-11-06-preview), which has multimodal capabilities. We leverage pymc (Abril-Pla et al., 2023), a Python probabilistic programming library. We use diffrax (Kidger, 2021), a Jax-based ODE library.
Experiment Setup Yes We run our pipeline for two rounds with three proposals each round. We use a temperature of 0.2 for the Proposal LM and temperature of 0.0 for the Critic LM. We use three in-context exemplars." and "We run a hyperparameter search over four widths (4, 8, 16, 32) and 3 depths (1,2,4). We use a learning rate of 3e-3 and train using full-batch gradient descent with Adam for 1500 iterations.