Decision Stacks: Flexible Reinforcement Learning via Modular Generative Models
Authors: Siyan Zhao, Aditya Grover
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results demonstrate the effectiveness of Decision Stacks for offline policy optimization for several MDP and POMDP environments, outperforming existing methods and enabling flexible generative decision making. |
| Researcher Affiliation | Academia | Siyan Zhao Department of Computer Science University of California Los Angeles siyanz@cs.ucla.edu Aditya Grover Department of Computer Science University of California Los Angeles adityag@cs.ucla.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | 1The project website and code can be found here: https://siyan-zhao.github.io/decision-stacks/ |
| Open Datasets | Yes | We experiment with D4RL environments and parameterize Decision Stacks with a diffusion-based observation model, an autoregressive transformer-based reward model, and an autoregressive transformer-based action model. |
| Dataset Splits | No | The paper uses D4RL environments, but no explicit mention of training, validation, and test splits (e.g., percentages or sample counts) was found. |
| Hardware Specification | Yes | Each model was trained on a single NVIDIA A5000 GPU. |
| Software Dependencies | No | The paper mentions "Adam optimizer" and "Re LU activations" but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | For other models, we use a batch size of 32, a learning rate of 3e 4, and training steps of 2e6 with Adam optimizer [Kingma and Ba, 2015]. The MLP action model and the MLP reward model is a two layered MLP with 512 hidden units and Re LU activations. The diffusion models noise model backbone is a U-Net with six repeated residual blocks. Each block consists of two temporal convolutions, each followed by group norm [Wu and He, 2018], and a final Mish nonlinearity [Misra, 2019]. For Maze2D experiments, different mazes require different average episode steps to reach to target, we use the planning horizon of 180 for umaze, 256 for medium-maze and 300 for large maze. For Maze2D experiments, we use the warm-starting strategy where we perform a reduced number of forward diffusion steps using a previously generated plan as in Diffuser [Janner et al., 2022] to speed up the computation. The training of all three models, including the observation, action, and reward models, is conducted using the teacher-forcing technique [Williams and Zipser, 1989]. Additional hyperparameters can be found in the configuration files within our codebase. |