Extracting Reward Functions from Diffusion Models

Authors: Felipe Nuti, Tim Franzmeyer, João F. Henriques

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate our reward learning method along three axes. In the Maze2D environments [12], we learn a reward function by comparing a base diffusion model trained on exploratory trajectories and an expert diffusion model trained on goal-directed trajectories, as illustrated in Figure 1. We can see in Figure 3 that our method learns the correct reward function for varying maze configurations. In the common locomotion environments Hopper, Half Cheetah, and Walker2D [12, 7], we learn a reward function by comparing a low-performance base model to an expert diffusion model and demonstrate that steering the base model with the learned reward function results in a significantly improved performance. Beyond sequential-decision making, we learn a reward-like function by comparing a base image generation diffusion model (Stable Diffusion, [54]) to a safer version of Stable Diffusion [59]. Figure 2 shows that the learned reward function penalizes images with harmful content, such as violence and hate, while rewarding harmless images.
Researcher Affiliation Academia Felipe Nuti Tim Franzmeyer João F. Henriques {nuti, frtim, joao}@robots.ox.ac.uk University of Oxford
Pseudocode Yes Algorithm 1: Relative reward function training. Algorithm 2: Relative reward function training with access only to diffusion models
Open Source Code Yes 1Video and Code at https://www.robots.ox.ac.uk/~vgg/research/reward-diffusion/
Open Datasets Yes In the Maze2D environments [12]... In the common locomotion environments Hopper, Half Cheetah, and Walker-2D [12, 7]... we use the I2P prompt dataset introduced by Schramowski et al. [59].
Dataset Splits No A portion of the generated dataset containing an equal number of base and expert samples is set aside for model evaluation. We sample batches from the validation set. This text indicates a validation set was used, but does not specify the explicit split proportions or counts for train/validation/test to reproduce the data partitioning.
Hardware Specification Yes F.N. also used TPUs granted by the Google TPU Research Cloud (TRC) in the initial exploratory stages of the project. Reward functions took around 2 hours to train for 100000 steps for Maze environments, and around 4.5 hours for 50000 steps in Locomotion environments, also on an NVIDIA Tesla P40 GPU. For the Stable Diffusion experiment, it took around 50 minutes to run 6000 training steps on an NVIDIA Tesla M40 GPU.
Software Dependencies No The paper mentions software components such as Py Torch [46], Num Py [18], and Hugging Face Diffusers [67], but it does not specify their version numbers within the text.
Experiment Setup Yes In Table 5 we indicate the learning rate and batch size used for training the reward functions, as well as the number of denoising timesteps of the diffusion models they are trained against. We report the number of training steps for the models used to generate the plots and numerical results in the main paper. We use Adam [46] as an optimizer, without weight decay.