Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Large Language Bayes

Authors: Justin Domke

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, this produces sensible predictions from only data and an informal problem description, without the need to specify a formal model. ... Experiments illustrating that the ﬁnal approximated posterior captures user intent and is typically better than taking a naive average of formal models. (Sec. 4)
Researcher Affiliation	Academia	Justin Domke University of Massachusetts Amherst
Pseudocode	Yes	Algorithm 1 Theoretical exact LLB algorithm (intractable) ... Algorithm 2 Suggested generic approximate LLB recipe. ... Algorithm 3 The variant of Alg. 2 used in the experiments of this paper.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justiﬁcation: It is difﬁcult to provide code runnable by a third party given the usage of a local cluster and many non-portable modiﬁcations. Every effort has been made to make the results reproducible from the given description.
Open Datasets	No	It is likely that all standard models and datasets are included in LLM training data. To avoid the risk that the LLM would simply remember human-written models, these experiments use all-new problems and datasets. ... Since this paper will presumably also be included in future LLM datasets, any future work in this direction should not use these problems for evaluation with any LLM with a knowledge cutoff after the date this paper was ﬁrst made public, namely April 21, 2025.
Dataset Splits	No	The paper uses
Hardware Specification	Yes	Using a single A100 GPU, generating 1024 models took 10-15 minutes, depending on the problem. ... Models were generated using a single A100.
Software Dependencies	No	We experimented with using various LLMs to generate formal Bayesian models in various PPLs, including Stan [5], Num Pyro [26] and Py MC [1]. LLMs seemed better at generating Stan code, perhaps since more Stan code is available and included in LLM datasets. ... Models were generated using Llama-3.3-70B [14, 21] with 4-bit AWQ quantization [19].
Experiment Setup	Yes	For the problems below, 1024 models were generated using Llama-3.3-70B [14, 21] with 4-bit AWQ quantization [19]. ... For models that compiled, 2 chains of Stan’s implementation of NUTS [17] were run for 10,000 iterations each. ... The typical approach is to create some variational distribution q(z) (e.g. a Gaussian) and optimize it to maximize the lower-bound ... to provide the LLM with six examples of ostensible user inputs, along with high quality outputs (Appendix G).