Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Large Language Bayes

Authors: Justin Domke

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, this produces sensible predictions from only data and an informal problem description, without the need to specify a formal model. ... Experiments illustrating that the final approximated posterior captures user intent and is typically better than taking a naive average of formal models. (Sec. 4)
Researcher Affiliation Academia Justin Domke University of Massachusetts Amherst
Pseudocode Yes Algorithm 1 Theoretical exact LLB algorithm (intractable) ... Algorithm 2 Suggested generic approximate LLB recipe. ... Algorithm 3 The variant of Alg. 2 used in the experiments of this paper.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: It is difficult to provide code runnable by a third party given the usage of a local cluster and many non-portable modifications. Every effort has been made to make the results reproducible from the given description.
Open Datasets No It is likely that all standard models and datasets are included in LLM training data. To avoid the risk that the LLM would simply remember human-written models, these experiments use all-new problems and datasets. ... Since this paper will presumably also be included in future LLM datasets, any future work in this direction should not use these problems for evaluation with any LLM with a knowledge cutoff after the date this paper was first made public, namely April 21, 2025.
Dataset Splits No The paper uses
Hardware Specification Yes Using a single A100 GPU, generating 1024 models took 10-15 minutes, depending on the problem. ... Models were generated using a single A100.
Software Dependencies No We experimented with using various LLMs to generate formal Bayesian models in various PPLs, including Stan [5], Num Pyro [26] and Py MC [1]. LLMs seemed better at generating Stan code, perhaps since more Stan code is available and included in LLM datasets. ... Models were generated using Llama-3.3-70B [14, 21] with 4-bit AWQ quantization [19].
Experiment Setup Yes For the problems below, 1024 models were generated using Llama-3.3-70B [14, 21] with 4-bit AWQ quantization [19]. ... For models that compiled, 2 chains of Stan’s implementation of NUTS [17] were run for 10,000 iterations each. ... The typical approach is to create some variational distribution q(z) (e.g. a Gaussian) and optimize it to maximize the lower-bound ... to provide the LLM with six examples of ostensible user inputs, along with high quality outputs (Appendix G).