Decision Stacks: Flexible Reinforcement Learning via Modular Generative Models

Authors: Siyan Zhao, Aditya Grover

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results demonstrate the effectiveness of Decision Stacks for offline policy optimization for several MDP and POMDP environments, outperforming existing methods and enabling flexible generative decision making.
Researcher Affiliation Academia Siyan Zhao Department of Computer Science University of California Los Angeles siyanz@cs.ucla.edu Aditya Grover Department of Computer Science University of California Los Angeles adityag@cs.ucla.edu
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes 1The project website and code can be found here: https://siyan-zhao.github.io/decision-stacks/
Open Datasets Yes We experiment with D4RL environments and parameterize Decision Stacks with a diffusion-based observation model, an autoregressive transformer-based reward model, and an autoregressive transformer-based action model.
Dataset Splits No The paper uses D4RL environments, but no explicit mention of training, validation, and test splits (e.g., percentages or sample counts) was found.
Hardware Specification Yes Each model was trained on a single NVIDIA A5000 GPU.
Software Dependencies No The paper mentions "Adam optimizer" and "Re LU activations" but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes For other models, we use a batch size of 32, a learning rate of 3e 4, and training steps of 2e6 with Adam optimizer [Kingma and Ba, 2015]. The MLP action model and the MLP reward model is a two layered MLP with 512 hidden units and Re LU activations. The diffusion models noise model backbone is a U-Net with six repeated residual blocks. Each block consists of two temporal convolutions, each followed by group norm [Wu and He, 2018], and a final Mish nonlinearity [Misra, 2019]. For Maze2D experiments, different mazes require different average episode steps to reach to target, we use the planning horizon of 180 for umaze, 256 for medium-maze and 300 for large maze. For Maze2D experiments, we use the warm-starting strategy where we perform a reduced number of forward diffusion steps using a previously generated plan as in Diffuser [Janner et al., 2022] to speed up the computation. The training of all three models, including the observation, action, and reward models, is conducted using the teacher-forcing technique [Williams and Zipser, 1989]. Additional hyperparameters can be found in the configuration files within our codebase.