Is Value Learning Really the Main Bottleneck in Offline RL?
Authors: Seohong Park, Kevin Frans, Sergey Levine, Aviral Kumar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we perform a systematic empirical study of (1) value learning, (2) policy extraction, and (3) policy generalization in offline RL problems, analyzing how these components affect performance. |
| Researcher Affiliation | Academia | 1University of California, Berkeley 2Carnegie Mellon University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementation is based on jaxrl_minimal [20] and the official implementation of HIQL [44] (for offline goal-conditioned RL). ... [20] Dibya Ghosh. dibyaghosh/jaxrl_m, 2023. URL https://github.com/dibyaghosh/jaxrl_m. |
| Open Datasets | Yes | antmaze-large and gc-antmaze-large are based on the antmaze-large-diverse-v2 environment from the D4RL suite [12] |
| Dataset Splits | Yes | We randomly split the trajectories in a dataset into a training set (95%) and a validation set (5%) in our experiments. |
| Hardware Specification | Yes | We use an internal cluster consisting of A5000 GPUs to run our experiments. |
| Software Dependencies | No | Our implementation is based on jaxrl_minimal [20] and the official implementation of HIQL [44] (for offline goal-conditioned RL). We use an internal cluster consisting of A5000 GPUs to run our experiments. ... Table 2: Optimizer Adam [24]. Table 3: Layer Norm [3]. |
| Experiment Setup | Yes | We train agents for 1M steps (500K steps for gc-roboverse) with each pair of value learning and policy extraction algorithms. We evaluate the performance of the agent every 100K steps with 50 rollouts, and report the performance averaged over the last 3 evaluations and over 8 seeds. ... Table 2: Learning rate 0.0003, Discount factor γ 0.99. Table 3: Minibatch size, MLP dimensions, IQL expectile, AWR α, DDPG+BC α, Sf BC N. |