Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement

Authors: Hui Yuan, Kaixuan Huang, Chengzhuo Ni, Minshuo Chen, Mengdi Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical results to validate our theory and highlight the relationship between the strength of extrapolation and the quality of generated samples.
Researcher Affiliation Academia Department of Electrical and Computer Engineering, Princeton University. Authors emails are: {huiyuan, kaixuanh, cn10, mc0750, mengdiw}@princeton.edu
Pseudocode Yes Algorithm 1 Reward-Conditioned Generation via Diffusion Model (RCGDM) 1: Input: Datasets Dunlabel, Dlabel, target reward value a, early-stopping time t0, noise level ν. (Note: in the following psuedo-code, ϕt(x) is the Gaussian density and η is the step size of discrete backward SDE, see 3.3 for elaborations on conditional diffusion) 2: Reward Learning: Estimate the reward function by bf argmin f F (xi,yi) Dlabel ℓ(f(xi), yi), (3.1) where ℓis a loss and F is a function class. 3: Pseudo labeling: Use the learned function bf to evaluate unlabeled data Dunlabel and augment it with pseudo labeles: e D = {(xj, byj) = bf(xj) + ξj}n1 j=1 for ξj i.i.d. N(0, ν2).
Open Source Code Yes Our code is available at https://github.com/Kaffaljidhmah2/RCGDM.
Open Datasets Yes Instead of training a diffusion model from scratch, we use Stable Diffusion v1.5 [39], pre-trained on LAION dataset [42]. ... We start from an Image Net [10] pre-trained Res Net-18 [17] model... We use the images and the corresponding outputs as the training dataset. We use the ground-truth reward model to compute a scalar output for each instance in the CIFAR-10 [24] training dataset
Dataset Splits No We use the ground-truth reward model to compute a scalar output for each instance in the CIFAR-10 [24] training dataset and perturb the output by adding a Gaussian noise from N(0, 0.01). We use the images and the corresponding outputs as the training dataset. The paper does not specify train/validation/test splits for the datasets used in their experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or cloud instance types) used for running its experiments.
Software Dependencies No We use the 1-dimensional version of the UNet [40] to approximate the score function. ... When training the score function, we choose Adam as the optimizer with learning rate 8 10 5. The paper mentions software like UNet and Adam optimizer but does not provide specific version numbers for these or any other software libraries.
Experiment Setup Yes The predictor is trained using 8192 samples and the score function is trained using 65536 samples. When training the score function, we choose Adam as the optimizer with learning rate 8 10 5. We train the score function for 10 epochs, each epoch doing a full iteration over the whole training dataset with batch size 32.