The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling
Authors: Michael Betancourt
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When the full data are used, numerical trajectories generated by the second-order symplectic integrator constructed above closely follow the true trajectories (Figure 2a). Approximating the potential with a subsample of the data introduces the aforementioned bias, which shifts the stochastic trajectory away from the exact trajectory despite negligible error from the symplectic integrator itself (Figure 2b). Only when the size of each subsample approaches the full data set, and the computational benefit of subsampling fades, does the stochastic trajectory provide a reasonable approximation to the exact trajectory (Figure 2c). As a surrogate for the accuracy of the resulting samples I will use the average Metropolis acceptance probability of each new state using the full data. |
| Researcher Affiliation | Academia | Michael Betancourt BETANALPHA@GMAIL.COM Department of Statistics, University of Warwick, Coventry, UK CV4 7AL |
| Pseudocode | No | The paper describes algorithms conceptually and mathematically but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link regarding the availability of open-source code for the methodology described. |
| Open Datasets | No | The paper mentions generating data: 'Here I take σ = 2, m = 0, s = 1, and generate N = 500 data points assuming µ = 1.' and refers to a 'multivariate generalization of (2)' where 'true µd are sampled from µd N(0, 1)'. These are generated synthetic datasets, not publicly available or open datasets with concrete access information. |
| Dataset Splits | No | The paper discusses evaluations using 'full data' and 'subsample' but does not specify conventional training, validation, or testing dataset splits with percentages, sample counts, or citations to predefined splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running experiments, such as GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) required for reproducibility. |
| Experiment Setup | Yes | Here I take σ = 2, m = 0, s = 1, and generate N = 500 data points assuming µ = 1. The data were partitioned into J = 25 batches of B = 20 data, the subsample used for each trajectory is randomly selected from the first five batches, and the step size of the subsampled trajectory is reduced by N/(J B) = 5 to equalize the computational cost with full data trajectories. |