OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning

Authors: Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, Ofir Nachum

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. To the best of our knowledge, ours is the first work to theoretically justify and experimentally verify the benefits of primitive learning in offline RL settings, showing that hierarchies can provide temporal abstraction that allows us to reduce the effect of compounding errors issue in offline RL. In this section, we will empirically show that OPAL improves learning of downstream tasks with offline RL, and then briefly show the same with few-shot imitation learning, online RL, and online multi-task transfer learning.
Researcher Affiliation Collaboration Anurag Ajay 1, Aviral Kumar3, Pulkit Agrawal1, Sergey Levine2,3, Ofir Nachum2 1MIT, 2Google Research, 3UC Berkeley Work done during an internship at Google Brain
Pseudocode No The paper describes algorithmic steps within the text and refers to algorithms from other works (e.g., Algorithm 2 from Jabri et al. (2019) in Appendix F), but it does not include any self-contained, structured pseudocode blocks or figures explicitly labeled 'Algorithm' or 'Pseudocode'.
Open Source Code Yes Visualizations and code are available at https://sites.google.com/view/opal-iclr
Open Datasets Yes We use environments and datasets provided in D4RL (Fu et al., 2020). Since the aim of our method is specifically to perform offline RL in settings where the offline data comprises varied and undirected multi-task behavior, we focus on Antmaze medium (diverse dataset), Antmaze large (diverse dataset), and Franka kitchen (mixed and partial datasets).
Dataset Splits No The paper references datasets from D4RL and describes experimental results, but it does not explicitly provide details on how the datasets were split into training, validation, and test sets (e.g., specific percentages, counts, or a detailed splitting methodology). While D4RL provides these datasets, the paper does not specify the exact splits used for its own experiments.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. Generic terms like 'simulated ant robot' or 'franka robot' refer to the environment models, not the computing hardware.
Software Dependencies No The paper mentions software components and algorithms like 'Adam optimizer (Kingma & Ba, 2014)', 'SAC (Haarnoja et al., 2018)', 'PPO (Schulman et al., 2017)', 'Double DQN (Van Hasselt et al., 2015)', and 'rlkit code base'. However, it does not provide specific version numbers for Python, PyTorch, TensorFlow, or any other libraries or frameworks, which are crucial for reproducibility.
Experiment Setup Yes Unless otherwise stated, we use c = 10 and dim(Z) = 8. We use H = 200 for antmaze environments and H = 256 for kitchen environments. In both cases, OPAL was trained for 100 epochs with a fixed learning rate of 1e 3, β = 0.1 (Lynch et al., 2020), Adam optimizer (Kingma & Ba, 2014) and a batch size of 50. We used policy learning rate of 3e 5, q value learning rate of 3e 4, and primitive learning rate of 3e 4. For antmaze tasks, we used CQL(H) variant with τ = 5 and learned α. For kitchen tasks we used CQL(ρ) variant with fixed α = 10. In both cases, we ensured α never dropped below 0.001.