Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Large Language Bayes
Authors: Justin Domke
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, this produces sensible predictions from only data and an informal problem description, without the need to specify a formal model. ... Experiments illustrating that the final approximated posterior captures user intent and is typically better than taking a naive average of formal models. (Sec. 4) |
| Researcher Affiliation | Academia | Justin Domke University of Massachusetts Amherst |
| Pseudocode | Yes | Algorithm 1 Theoretical exact LLB algorithm (intractable) ... Algorithm 2 Suggested generic approximate LLB recipe. ... Algorithm 3 The variant of Alg. 2 used in the experiments of this paper. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: It is difficult to provide code runnable by a third party given the usage of a local cluster and many non-portable modifications. Every effort has been made to make the results reproducible from the given description. |
| Open Datasets | No | It is likely that all standard models and datasets are included in LLM training data. To avoid the risk that the LLM would simply remember human-written models, these experiments use all-new problems and datasets. ... Since this paper will presumably also be included in future LLM datasets, any future work in this direction should not use these problems for evaluation with any LLM with a knowledge cutoff after the date this paper was first made public, namely April 21, 2025. |
| Dataset Splits | No | The paper uses |
| Hardware Specification | Yes | Using a single A100 GPU, generating 1024 models took 10-15 minutes, depending on the problem. ... Models were generated using a single A100. |
| Software Dependencies | No | We experimented with using various LLMs to generate formal Bayesian models in various PPLs, including Stan [5], Num Pyro [26] and Py MC [1]. LLMs seemed better at generating Stan code, perhaps since more Stan code is available and included in LLM datasets. ... Models were generated using Llama-3.3-70B [14, 21] with 4-bit AWQ quantization [19]. |
| Experiment Setup | Yes | For the problems below, 1024 models were generated using Llama-3.3-70B [14, 21] with 4-bit AWQ quantization [19]. ... For models that compiled, 2 chains of Stan’s implementation of NUTS [17] were run for 10,000 iterations each. ... The typical approach is to create some variational distribution q(z) (e.g. a Gaussian) and optimize it to maximize the lower-bound ... to provide the LLM with six examples of ostensible user inputs, along with high quality outputs (Appendix G). |