reproducibilityindex.ai

Bayesian Reparameterization of Reward-Conditioned Reinforcement Learning with Energy-based Models

Authors: Wenhao Ding, Tong Che, Ding Zhao, Marco Pavone

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct several experiments on two standard benchmarks to answer the following questions: Q1: How is the performance of our proposed method compared to existing offline RL methods? Q2: How do different target RTG strategies during inference influence the results? Q3: How does the observed RTG match the target RTG during the inference stage? Q4: How do different components in BR-RCRL influence the performance?
Researcher Affiliation	Collaboration	1Carnegie Mellon University, Pittsburgh, PA, US 2NVIDIA Research, Santa Clara, CA, US 3Stanford University, Palo Alto, CA, US. Correspondence to: Wenhao Ding <wenhaod@andrew.cmu.edu>, Tong Che <tongc@nvidia.com>.
Pseudocode	Yes	Algorithm 1 Adaptive Inference for BR-RCRL
Open Source Code	No	The source code of our experiments will be released after the blind review process.
Open Datasets	Yes	We evaluate our method in 9 Gym-Mu Jo Co tasks (Fu et al., 2020) and 4 Atari games (Mnih et al., 2013), which are both standard offline RL benchmarks and cover continuous and discrete action spaces. ... The offline dataset of this benchmark is collected from the replay buffer of an online DQN agent (Mnih et al., 2015).
Dataset Splits	No	The paper mentions using 10% of the Atari buffer for experiments and different types of datasets (Medium, Medium-Replay, Med-Expert) for Gym-Mujoco, but it does not explicitly provide percentages or counts for training, validation, and test splits needed for reproduction.
Hardware Specification	Yes	The experiments were conducted on a device with 256GB memory and 2 NVIDIA RTX A6000 GPUs. The Atari experiments require 150GB of memory to load the 10% Atari dataset.
Software Dependencies	No	The paper mentions using a "derivative-free optimizer (DFO) proposed in (Florence et al., 2022)" and refers to "Langevin MCMC (Welling & Teh, 2011)", but it does not provide specific version numbers for these or other software libraries or frameworks used.
Experiment Setup	Yes	The hyperparameters used in Gym-Mujoco experiments and Atari experiments are summarized in Table 7 and Table 8, respectively. We use the same hyperparameters for all experiments in the same benchmark. ... Table 7 lists parameters such as training iteration 70,000, learning rate 0.0005, batch size 512, λ weight of L1(θ) 1.0, etc.