Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Is Value Learning Really the Main Bottleneck in Offline RL?
Authors: Seohong Park, Kevin Frans, Sergey Levine, Aviral Kumar
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we perform a systematic empirical study of (1) value learning, (2) policy extraction, and (3) policy generalization in offline RL problems, analyzing how these components affect performance. |
| Researcher Affiliation | Academia | 1University of California, Berkeley 2Carnegie Mellon University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementation is based on jaxrl_minimal [20] and the official implementation of HIQL [44] (for offline goal-conditioned RL). ... [20] Dibya Ghosh. dibyaghosh/jaxrl_m, 2023. URL https://github.com/dibyaghosh/jaxrl_m. |
| Open Datasets | Yes | antmaze-large and gc-antmaze-large are based on the antmaze-large-diverse-v2 environment from the D4RL suite [12] |
| Dataset Splits | Yes | We randomly split the trajectories in a dataset into a training set (95%) and a validation set (5%) in our experiments. |
| Hardware Specification | Yes | We use an internal cluster consisting of A5000 GPUs to run our experiments. |
| Software Dependencies | No | Our implementation is based on jaxrl_minimal [20] and the official implementation of HIQL [44] (for offline goal-conditioned RL). We use an internal cluster consisting of A5000 GPUs to run our experiments. ... Table 2: Optimizer Adam [24]. Table 3: Layer Norm [3]. |
| Experiment Setup | Yes | We train agents for 1M steps (500K steps for gc-roboverse) with each pair of value learning and policy extraction algorithms. We evaluate the performance of the agent every 100K steps with 50 rollouts, and report the performance averaged over the last 3 evaluations and over 8 seeds. ... Table 2: Learning rate 0.0003, Discount factor γ 0.99. Table 3: Minibatch size, MLP dimensions, IQL expectile, AWR α, DDPG+BC α, Sf BC N. |