Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration

Authors: Samuel Holt, Max Ruiz Luyten, Antonin Berthon, Mihaela Van Der Schaar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate G-Sim to verify that it can generate simulators with higher fidelity than existing discovery or data-driven world models. Our experiments use both GFO and SBI for calibration. Benchmark Environments. We evaluate G-Sim on three real-world-inspired simulation tasks that together capture (1) stochastic transitions, (2) rich, discrete state updates, and (3) partially observed states. Each task provides a dataset of state-action trajectories and a textual description of the environment, sampled from a carefully hand-designed simulator. [...] We evaluated all benchmark methods across the three environments, with results tabulated in Table 2. G-Sim consistently achieves the lowest Wasserstein distance on the held-out test data, indicating that its generated simulators model the ground-truth system dynamics with the highest fidelity. The performance gap is particularly pronounced in the complex Hospital Bed Scheduling task, where data-driven methods struggle significantly.
Researcher Affiliation	Academia	Samuel Holt * 1 Max Ruiz Luyten * 1 Antonin Berthon 1 Mihaela van der Schaar 1 1University of Cambridge. Correspondence to: Samuel Holt <EMAIL>.
Pseudocode	Yes	Algorithm 1 G-Sim: High-Level Pseudocode Require: Domain knowledge K (text descriptions, constraints), Training data D = {D(1), . . . , D(L)}, LLM with a prompt function Prompt LLM( ), Calibration engine Calibrate Params( ) (either GFO or SBI), Diagnostics function Diag(λ, ω; D), Maximum iterations G, patience for early stopping. Ensure: A fully calibrated simulator (λ , ω ) minimizing the diagnostic score. [...] In the following we detail the full methodology for G-Sim, including pseudocode, training procedures, prompt templates, and diagnostics-driven refinement. Our approach builds on the framework described in Section 3 of the main paper.
Open Source Code	Yes	Code is available at https://github.com/samholt/ generative-simulations and we provide a broader research group code base at https://github.com/ vanderschaarlab/generative-simulations
Open Datasets	No	Benchmark Environments. We evaluate G-Sim on three real-world-inspired simulation tasks that together capture (1) stochastic transitions, (2) rich, discrete state updates, and (3) partially observed states. Each task provides a dataset of state-action trajectories and a textual description of the environment, sampled from a carefully hand-designed simulator. [...] We generate state-action trajectories by simulating over a fixed horizon T: Initial state: We set inventory 20, pipeline empty, backlog = 0, and t = 0. Policy: For simplicity, an agent might follow a simple reorder policy (e.g., (s, S) policy or a constant order) or an ε-greedy approach. Alternatively, actions can be random to promote exploration. Stochastic demand: Each day, demand is drawn from Poisson(λdemand). After N = 100 simulated rollouts of T = 60, we collect (statet, actiont, statet+1) tuples to form a dataset.
Dataset Splits	Yes	Sampling procedure for dataset generation. To create training and evaluation datasets: [...] We repeat this process for N initial seeds, thereby obtaining N state-action trajectories of length T. We then split these trajectories into training, validation, and test sets (e.g., Ntrain = 100, Nval = 100, Ntest = 100). With each trajectory, we store the transitions st, at, st+1 for subsequent fitting and analysis. [...] We collect the resulting day-by-day trajectories of the state and produce train/validation/test splits for model calibration and evaluation.
Hardware Specification	Yes	All experiments and training were performed using a single Intel Core i9-12900K CPU @ 3.20GHz, 64GB RAM with an Nvidia RTX3090 GPU 24GB.
Software Dependencies	No	Implementation with Evo Torch. We implement the GFO step using the Genetic Algorithm class from Evo Torch. [...] We use the Neural Posterior Estimation (NPE) algorithm from the sbi library. [...] The code g e n e r a t e d should include the complete step f u n c t i o n body in Num Py , f u l l y f u n c t i o n a l , no p l a c e h o l d e r s .
Experiment Setup	Yes	Our key Evo Torch settings are: Population size: 200, Number of generations: 10, Search operators: Simulated Binary Cross Over with tournament size 4, crossover rate 1.0, and η = 8, Gaussian Mutation with standard deviation stdev=0.03. [...] Simulation Budget. Based on our implementation, we use a simulation budget of 1,000 simulations to train the SBI posterior estimator. [...] Hyperparameters for G-Sim. In our experiments, we typically use a maximum of 5 refinement loops, a patience of 3 for early stopping, a population size of 200 in evolutionary search, 10 generations, and a mutation rate of 0.03 for parameter changes.