reproducibilityindex.ai

OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning

Authors: Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, Ofir Nachum

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we focus on this ofﬂine setting. Our main insight is that, when presented with ofﬂine data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. To the best of our knowledge, ours is the ﬁrst work to theoretically justify and experimentally verify the beneﬁts of primitive learning in ofﬂine RL settings, showing that hierarchies can provide temporal abstraction that allows us to reduce the effect of compounding errors issue in ofﬂine RL. In this section, we will empirically show that OPAL improves learning of downstream tasks with ofﬂine RL, and then brieﬂy show the same with few-shot imitation learning, online RL, and online multi-task transfer learning.
Researcher Affiliation	Collaboration	Anurag Ajay 1, Aviral Kumar3, Pulkit Agrawal1, Sergey Levine2,3, Oﬁr Nachum2 1MIT, 2Google Research, 3UC Berkeley Work done during an internship at Google Brain
Pseudocode	No	The paper describes algorithmic steps within the text and refers to algorithms from other works (e.g., Algorithm 2 from Jabri et al. (2019) in Appendix F), but it does not include any self-contained, structured pseudocode blocks or figures explicitly labeled 'Algorithm' or 'Pseudocode'.
Open Source Code	Yes	Visualizations and code are available at https://sites.google.com/view/opal-iclr
Open Datasets	Yes	We use environments and datasets provided in D4RL (Fu et al., 2020). Since the aim of our method is speciﬁcally to perform ofﬂine RL in settings where the ofﬂine data comprises varied and undirected multi-task behavior, we focus on Antmaze medium (diverse dataset), Antmaze large (diverse dataset), and Franka kitchen (mixed and partial datasets).
Dataset Splits	No	The paper references datasets from D4RL and describes experimental results, but it does not explicitly provide details on how the datasets were split into training, validation, and test sets (e.g., specific percentages, counts, or a detailed splitting methodology). While D4RL provides these datasets, the paper does not specify the exact splits used for its own experiments.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. Generic terms like 'simulated ant robot' or 'franka robot' refer to the environment models, not the computing hardware.
Software Dependencies	No	The paper mentions software components and algorithms like 'Adam optimizer (Kingma & Ba, 2014)', 'SAC (Haarnoja et al., 2018)', 'PPO (Schulman et al., 2017)', 'Double DQN (Van Hasselt et al., 2015)', and 'rlkit code base'. However, it does not provide specific version numbers for Python, PyTorch, TensorFlow, or any other libraries or frameworks, which are crucial for reproducibility.
Experiment Setup	Yes	Unless otherwise stated, we use c = 10 and dim(Z) = 8. We use H = 200 for antmaze environments and H = 256 for kitchen environments. In both cases, OPAL was trained for 100 epochs with a ﬁxed learning rate of 1e 3, β = 0.1 (Lynch et al., 2020), Adam optimizer (Kingma & Ba, 2014) and a batch size of 50. We used policy learning rate of 3e 5, q value learning rate of 3e 4, and primitive learning rate of 3e 4. For antmaze tasks, we used CQL(H) variant with τ = 5 and learned α. For kitchen tasks we used CQL(ρ) variant with ﬁxed α = 10. In both cases, we ensured α never dropped below 0.001.