Bayesian Reparameterization of Reward-Conditioned Reinforcement Learning with Energy-based Models

Authors: Wenhao Ding, Tong Che, Ding Zhao, Marco Pavone

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct several experiments on two standard benchmarks to answer the following questions: Q1: How is the performance of our proposed method compared to existing offline RL methods? Q2: How do different target RTG strategies during inference influence the results? Q3: How does the observed RTG match the target RTG during the inference stage? Q4: How do different components in BR-RCRL influence the performance?
Researcher Affiliation Collaboration 1Carnegie Mellon University, Pittsburgh, PA, US 2NVIDIA Research, Santa Clara, CA, US 3Stanford University, Palo Alto, CA, US. Correspondence to: Wenhao Ding <wenhaod@andrew.cmu.edu>, Tong Che <tongc@nvidia.com>.
Pseudocode Yes Algorithm 1 Adaptive Inference for BR-RCRL
Open Source Code No The source code of our experiments will be released after the blind review process.
Open Datasets Yes We evaluate our method in 9 Gym-Mu Jo Co tasks (Fu et al., 2020) and 4 Atari games (Mnih et al., 2013), which are both standard offline RL benchmarks and cover continuous and discrete action spaces. ... The offline dataset of this benchmark is collected from the replay buffer of an online DQN agent (Mnih et al., 2015).
Dataset Splits No The paper mentions using 10% of the Atari buffer for experiments and different types of datasets (Medium, Medium-Replay, Med-Expert) for Gym-Mujoco, but it does not explicitly provide percentages or counts for training, validation, and test splits needed for reproduction.
Hardware Specification Yes The experiments were conducted on a device with 256GB memory and 2 NVIDIA RTX A6000 GPUs. The Atari experiments require 150GB of memory to load the 10% Atari dataset.
Software Dependencies No The paper mentions using a "derivative-free optimizer (DFO) proposed in (Florence et al., 2022)" and refers to "Langevin MCMC (Welling & Teh, 2011)", but it does not provide specific version numbers for these or other software libraries or frameworks used.
Experiment Setup Yes The hyperparameters used in Gym-Mujoco experiments and Atari experiments are summarized in Table 7 and Table 8, respectively. We use the same hyperparameters for all experiments in the same benchmark. ... Table 7 lists parameters such as training iteration 70,000, learning rate 0.0005, batch size 512, λ weight of L1(θ) 1.0, etc.