Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement
Authors: Hui Yuan, Kaixuan Huang, Chengzhuo Ni, Minshuo Chen, Mengdi Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical results to validate our theory and highlight the relationship between the strength of extrapolation and the quality of generated samples. |
| Researcher Affiliation | Academia | Department of Electrical and Computer Engineering, Princeton University. Authors emails are: {huiyuan, kaixuanh, cn10, mc0750, mengdiw}@princeton.edu |
| Pseudocode | Yes | Algorithm 1 Reward-Conditioned Generation via Diffusion Model (RCGDM) 1: Input: Datasets Dunlabel, Dlabel, target reward value a, early-stopping time t0, noise level ν. (Note: in the following psuedo-code, ϕt(x) is the Gaussian density and η is the step size of discrete backward SDE, see 3.3 for elaborations on conditional diffusion) 2: Reward Learning: Estimate the reward function by bf argmin f F (xi,yi) Dlabel ℓ(f(xi), yi), (3.1) where ℓis a loss and F is a function class. 3: Pseudo labeling: Use the learned function bf to evaluate unlabeled data Dunlabel and augment it with pseudo labeles: e D = {(xj, byj) = bf(xj) + ξj}n1 j=1 for ξj i.i.d. N(0, ν2). |
| Open Source Code | Yes | Our code is available at https://github.com/Kaffaljidhmah2/RCGDM. |
| Open Datasets | Yes | Instead of training a diffusion model from scratch, we use Stable Diffusion v1.5 [39], pre-trained on LAION dataset [42]. ... We start from an Image Net [10] pre-trained Res Net-18 [17] model... We use the images and the corresponding outputs as the training dataset. We use the ground-truth reward model to compute a scalar output for each instance in the CIFAR-10 [24] training dataset |
| Dataset Splits | No | We use the ground-truth reward model to compute a scalar output for each instance in the CIFAR-10 [24] training dataset and perturb the output by adding a Gaussian noise from N(0, 0.01). We use the images and the corresponding outputs as the training dataset. The paper does not specify train/validation/test splits for the datasets used in their experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | We use the 1-dimensional version of the UNet [40] to approximate the score function. ... When training the score function, we choose Adam as the optimizer with learning rate 8 10 5. The paper mentions software like UNet and Adam optimizer but does not provide specific version numbers for these or any other software libraries. |
| Experiment Setup | Yes | The predictor is trained using 8192 samples and the score function is trained using 65536 samples. When training the score function, we choose Adam as the optimizer with learning rate 8 10 5. We train the score function for 10 epochs, each epoch doing a full iteration over the whole training dataset with batch size 32. |