Efficient Training of Energy-Based Models Using Jarzynski Equality

Authors: Davide Carbone, Mengjian Hua, Simon Coste, Eric Vanden-Eijnden

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate these results with numerical experiments on Gaussian mixture distributions as well as the MNIST and CIFAR-10 datasets.
Researcher Affiliation Academia Davide Carbone Dipartimento di Scienze Matematiche, Politecnico di Torino Istituto Nazionale di Fisica Nucleare, Sezione di Torino davide.carbone@polito.it Mengjian Hua Courant Institute of Mathematical Sciences, New York University mh5113@nyu.edu Simon Coste LPSM, Université Paris-Cité simon.coste@u-paris.fr Eric Vanden-Eijnden Courant Institute of Mathematical Sciences, New York University eve2@nyu.edu
Pseudocode Yes Algorithm 1 Sequential Monte-Carlo training with Jarzynski correction
Open Source Code Yes The code used to perform these new experiments is available in the anonymized Git Hub referenced in our paper. Up to date images available at https: //github.com/submissionx12/EBMs_Jarzynski.
Open Datasets Yes Next, we perform empirical experiments on the MNIST dataset to answer the following question: when it comes to high-dimensional datasets with multiple modes, can our method produces an EBM that generates high-quality samples and captures the relative weights of the modes accurately?
Dataset Splits No No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning was found.
Hardware Specification Yes All the experiments were performed on a single A100 GPU.
Software Dependencies No No specific ancillary software details (e.g., library or solver names with version numbers) were found.
Experiment Setup Yes The hyperparameters are the same in all cases: we take N = 4096 Langevin walkers with a mini-batch size N = 256. We use the Adam optimizer with learning rate η = 10 4 and inject a Gaussian noise of standard deviation σ = 3 10 2 to the dataset while performing gradient clipping in Langevin sampling for better performance. All the experiments were performed on a single A100 GPU. Training for 600 epochs took about 34 hours with the PCD algorithm (w/ and w/o data augmentation) and about 36 hours with our method.