reproducibilityindex.ai

Is Value Learning Really the Main Bottleneck in Offline RL?

Authors: Seohong Park, Kevin Frans, Sergey Levine, Aviral Kumar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we perform a systematic empirical study of (1) value learning, (2) policy extraction, and (3) policy generalization in offline RL problems, analyzing how these components affect performance.
Researcher Affiliation	Academia	1University of California, Berkeley 2Carnegie Mellon University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our implementation is based on jaxrl_minimal [20] and the official implementation of HIQL [44] (for offline goal-conditioned RL). ... [20] Dibya Ghosh. dibyaghosh/jaxrl_m, 2023. URL https://github.com/dibyaghosh/jaxrl_m.
Open Datasets	Yes	antmaze-large and gc-antmaze-large are based on the antmaze-large-diverse-v2 environment from the D4RL suite [12]
Dataset Splits	Yes	We randomly split the trajectories in a dataset into a training set (95%) and a validation set (5%) in our experiments.
Hardware Specification	Yes	We use an internal cluster consisting of A5000 GPUs to run our experiments.
Software Dependencies	No	Our implementation is based on jaxrl_minimal [20] and the official implementation of HIQL [44] (for offline goal-conditioned RL). We use an internal cluster consisting of A5000 GPUs to run our experiments. ... Table 2: Optimizer Adam [24]. Table 3: Layer Norm [3].
Experiment Setup	Yes	We train agents for 1M steps (500K steps for gc-roboverse) with each pair of value learning and policy extraction algorithms. We evaluate the performance of the agent every 100K steps with 50 rollouts, and report the performance averaged over the last 3 evaluations and over 8 seeds. ... Table 2: Learning rate 0.0003, Discount factor γ 0.99. Table 3: Minibatch size, MLP dimensions, IQL expectile, AWR α, DDPG+BC α, Sf BC N.