Improving Offline RL by Blending Heuristics

Authors: Sinong Geng, Aldo Pacchiano, Andrey Kolobov, Ching-An Cheng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we run HUBL with the four aforementioned offline RL methods CQL, TD3+BC, IQL, and ATAC and show that enhancing these So TA algorithms with HUBL can improve their performance by 9% on average across 27 datasets of D4RL (Fu et al., 2020) and Meta-World (Yu et al., 2020).
Researcher Affiliation Collaboration Sinong Geng Princeton University Princeton, NJ Aldo Pacchiano Boston University Broad Institute of MIT and Harvard Boston, MA Andrey Kolobov Microsoft Research Redmond, WA Ching-An Cheng Microsoft Research Redmond, WA
Pseudocode Yes Algorithm 1 HUBL + Offline RL 1: Input: Dataset D = {(s, a, s , r, γ)} 2: Compute ht for each trajectory in D 3: Compute λt for each trajectory in D 4: Relabel r & γ by ht and λt as r and γ and create D = {(s, a, s , r, γ)} 5: ˆπ Offline RL on D
Open Source Code No The paper lists code sources for the base offline RL methods (ATAC, CQL, IQL, TD3+BC) in Table 3. However, it does not provide a direct link or explicit statement that the code for HUBL itself, or the authors' implementation of HUBL, is open-source or publicly available.
Open Datasets Yes We study 27 benchmark datasets in D4RL and Meta-World. ... on 27 datasets of D4RL (Fu et al., 2020) and Meta-World (Yu et al., 2020).
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with specific percentages, counts, or references to predefined splits.
Hardware Specification Yes Experiments with ATAC, IQL, and TD3+BC are ran on Standard F4S V2 nodes of Azure, and experiments with CQL are ran on NC6S V2 nodes of Azure.
Software Dependencies No The paper mentions that "The first-order optimization is implemented by ADAM (Kingma and Ba, 2014)" and refers to base methods using PyTorch (in Table 3). However, it does not provide specific version numbers for any software components (e.g., Python, PyTorch, or other libraries).
Experiment Setup Yes For each dataset, the hyperparameters of the base methods are tuned over six different configurations suggested by the original papers. Such configurations are summarized in Table 5. ... The first-order optimization is implemented by ADAM (Kingma and Ba, 2014) with a minibatch size as 256. The learning rates are selected following the original implementation and are reported in Table 4.