reproducibilityindex.ai

The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling

Authors: Michael Betancourt

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When the full data are used, numerical trajectories generated by the second-order symplectic integrator constructed above closely follow the true trajectories (Figure 2a). Approximating the potential with a subsample of the data introduces the aforementioned bias, which shifts the stochastic trajectory away from the exact trajectory despite negligible error from the symplectic integrator itself (Figure 2b). Only when the size of each subsample approaches the full data set, and the computational beneﬁt of subsampling fades, does the stochastic trajectory provide a reasonable approximation to the exact trajectory (Figure 2c). As a surrogate for the accuracy of the resulting samples I will use the average Metropolis acceptance probability of each new state using the full data.
Researcher Affiliation	Academia	Michael Betancourt BETANALPHA@GMAIL.COM Department of Statistics, University of Warwick, Coventry, UK CV4 7AL
Pseudocode	No	The paper describes algorithms conceptually and mathematically but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any statement or link regarding the availability of open-source code for the methodology described.
Open Datasets	No	The paper mentions generating data: 'Here I take σ = 2, m = 0, s = 1, and generate N = 500 data points assuming µ = 1.' and refers to a 'multivariate generalization of (2)' where 'true µd are sampled from µd N(0, 1)'. These are generated synthetic datasets, not publicly available or open datasets with concrete access information.
Dataset Splits	No	The paper discusses evaluations using 'full data' and 'subsample' but does not specify conventional training, validation, or testing dataset splits with percentages, sample counts, or citations to predefined splits.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies	No	The paper does not list any specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) required for reproducibility.
Experiment Setup	Yes	Here I take σ = 2, m = 0, s = 1, and generate N = 500 data points assuming µ = 1. The data were partitioned into J = 25 batches of B = 20 data, the subsample used for each trajectory is randomly selected from the ﬁrst ﬁve batches, and the step size of the subsampled trajectory is reduced by N/(J B) = 5 to equalize the computational cost with full data trajectories.