reproducibilityindex.ai

Automated Statistical Model Discovery with Language Models

Authors: Michael Y. Li, Emily Fox, Noah Goodman

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method in three settings in probabilistic modeling: searching within a restricted space of models, searching over an open-ended space, and improving expert models under natural language constraints (e.g., this model should be interpretable to an ecologist). Our method identifies models on par with human expert designed models and extends classic models in interpretable ways. Our results highlight the promise of LM-driven model discovery.
Researcher Affiliation	Collaboration	1Department of Computer Science, Stanford University 2Department of Statistics, Stanford University 3Chan Zuckerberg Biohub San Francisco 4Department of Psychology, Stanford University.
Pseudocode	Yes	Algorithm 1 Automated Model Discovery with LMs input dataset D, number of rounds T, k number of exemplars, m number of proposals per round, (optional) warm-start example z0, function for scoring a program score, (optional) function for producing natural language feedback criticize Z while t < T do {zt i}m i=1 q LM( \|Z, z0, ht, D) {si}m i=1 score-all(score, {zt i}m i=1, D) Z select-exemplars(k, {zt i}m i=1, {si}m i=1) ht+1 criticize({zt i}m i=1, {si}m i=1, ht) end while
Open Source Code	No	The paper does not provide an explicit statement or link for the open-source code of their own methodology.
Open Datasets	Yes	We consider four real world datasets from the Stan Posterior DB dataset (Magnusson et al., 2023)
Dataset Splits	No	The paper mentions 'training datapoints' and 'held-out test data' for different experiments but does not provide specific train/validation/test dataset splits (e.g., percentages or counts) or a detailed splitting methodology applicable across all experiments.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments.
Software Dependencies	Yes	In our experiments, we use GPT-4 V (Achiam et al., 2023) (gpt4-11-06-preview), which has multimodal capabilities. We leverage pymc (Abril-Pla et al., 2023), a Python probabilistic programming library. We use diffrax (Kidger, 2021), a Jax-based ODE library.
Experiment Setup	Yes	We run our pipeline for two rounds with three proposals each round. We use a temperature of 0.2 for the Proposal LM and temperature of 0.0 for the Critic LM. We use three in-context exemplars." and "We run a hyperparameter search over four widths (4, 8, 16, 32) and 3 depths (1,2,4). We use a learning rate of 3e-3 and train using full-batch gradient descent with Adam for 1500 iterations.