Improving Offline RL by Blending Heuristics
Authors: Sinong Geng, Aldo Pacchiano, Andrey Kolobov, Ching-An Cheng
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we run HUBL with the four aforementioned offline RL methods CQL, TD3+BC, IQL, and ATAC and show that enhancing these So TA algorithms with HUBL can improve their performance by 9% on average across 27 datasets of D4RL (Fu et al., 2020) and Meta-World (Yu et al., 2020). |
| Researcher Affiliation | Collaboration | Sinong Geng Princeton University Princeton, NJ Aldo Pacchiano Boston University Broad Institute of MIT and Harvard Boston, MA Andrey Kolobov Microsoft Research Redmond, WA Ching-An Cheng Microsoft Research Redmond, WA |
| Pseudocode | Yes | Algorithm 1 HUBL + Offline RL 1: Input: Dataset D = {(s, a, s , r, γ)} 2: Compute ht for each trajectory in D 3: Compute λt for each trajectory in D 4: Relabel r & γ by ht and λt as r and γ and create D = {(s, a, s , r, γ)} 5: ˆπ Offline RL on D |
| Open Source Code | No | The paper lists code sources for the base offline RL methods (ATAC, CQL, IQL, TD3+BC) in Table 3. However, it does not provide a direct link or explicit statement that the code for HUBL itself, or the authors' implementation of HUBL, is open-source or publicly available. |
| Open Datasets | Yes | We study 27 benchmark datasets in D4RL and Meta-World. ... on 27 datasets of D4RL (Fu et al., 2020) and Meta-World (Yu et al., 2020). |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with specific percentages, counts, or references to predefined splits. |
| Hardware Specification | Yes | Experiments with ATAC, IQL, and TD3+BC are ran on Standard F4S V2 nodes of Azure, and experiments with CQL are ran on NC6S V2 nodes of Azure. |
| Software Dependencies | No | The paper mentions that "The first-order optimization is implemented by ADAM (Kingma and Ba, 2014)" and refers to base methods using PyTorch (in Table 3). However, it does not provide specific version numbers for any software components (e.g., Python, PyTorch, or other libraries). |
| Experiment Setup | Yes | For each dataset, the hyperparameters of the base methods are tuned over six different configurations suggested by the original papers. Such configurations are summarized in Table 5. ... The first-order optimization is implemented by ADAM (Kingma and Ba, 2014) with a minibatch size as 256. The learning rates are selected following the original implementation and are reported in Table 4. |