Is Value Learning Really the Main Bottleneck in Offline RL?

Authors: Seohong Park, Kevin Frans, Sergey Levine, Aviral Kumar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we perform a systematic empirical study of (1) value learning, (2) policy extraction, and (3) policy generalization in offline RL problems, analyzing how these components affect performance.
Researcher Affiliation Academia 1University of California, Berkeley 2Carnegie Mellon University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our implementation is based on jaxrl_minimal [20] and the official implementation of HIQL [44] (for offline goal-conditioned RL). ... [20] Dibya Ghosh. dibyaghosh/jaxrl_m, 2023. URL https://github.com/dibyaghosh/jaxrl_m.
Open Datasets Yes antmaze-large and gc-antmaze-large are based on the antmaze-large-diverse-v2 environment from the D4RL suite [12]
Dataset Splits Yes We randomly split the trajectories in a dataset into a training set (95%) and a validation set (5%) in our experiments.
Hardware Specification Yes We use an internal cluster consisting of A5000 GPUs to run our experiments.
Software Dependencies No Our implementation is based on jaxrl_minimal [20] and the official implementation of HIQL [44] (for offline goal-conditioned RL). We use an internal cluster consisting of A5000 GPUs to run our experiments. ... Table 2: Optimizer Adam [24]. Table 3: Layer Norm [3].
Experiment Setup Yes We train agents for 1M steps (500K steps for gc-roboverse) with each pair of value learning and policy extraction algorithms. We evaluate the performance of the agent every 100K steps with 50 rollouts, and report the performance averaged over the last 3 evaluations and over 8 seeds. ... Table 2: Learning rate 0.0003, Discount factor γ 0.99. Table 3: Minibatch size, MLP dimensions, IQL expectile, AWR α, DDPG+BC α, Sf BC N.