Goal-conditioned Offline Planning from Curious Exploration

Authors: Marco Bagatella, Georg Martius

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Algorithms are evaluated in four MDP instantiations, namely maze_large and kitchen from D4RL [14], fetch_push from gymnasium-robotics [13] and an adapted pinpad environment [18]. These environments were chosen to drastically differ in dynamics, dimensionality, and task duration. In particular, maze_large involves the navigation to distant goals in a 2D maze by controlling the acceleration of a point mass, kitchen requires controlling a 9 DOF robot through diverse tasks, including manipulating a kettle and operating switches, fetch_push is a manipulation task controlled through operation space control, and pinpad requires navigating a room and pressing a sequence of buttons in the right order.
Researcher Affiliation Academia Marco Bagatella ETH Zürich & Max Planck Institute for Intelligent Systems Tübingen, Germany mbagatella@ethz.ch Georg Martius University of Tübingen & Max Planck Institute for Intelligent Systems Tübingen, Germany georg.martius@uni-tuebingen.de
Pseudocode Yes Algorithm S1 TD3 + MDP + Aggregation
Open Source Code Yes Our codebase builds upon mbrl-lib [34], and adapts it to implement unsupervised exploration and goal-conditioned value-learning algorithms. To ensure reproducibility, we make it publicly available 6. 6Code available at sites.google.com/view/gcopfce
Open Datasets No For each environment, we collect 200k exploratory transitions by curious exploration, corresponding to 2 hours of real-time interaction at 30Hz, which is an order of magnitude less than existing benchmarks [14, 24]. The paper describes collecting its own dataset but does not explicitly state that this collected dataset is publicly available or provide a link for access to it.
Dataset Splits No The paper does not explicitly provide percentages or counts for training, validation, and test splits. It mentions that "All algorithms are evaluated by success rate" in "test episodes" and refers to Appendix I for details. However, Appendix I focuses on implementation specifics and does not detail explicit data splits for train/validation/test.
Hardware Specification Yes Our method s training costs (TD3 + model-based planning + aggregation) are those of TD3 ( 70 minutes on a single NVIDIA RTX3060 GPU)
Software Dependencies No The paper mentions using "mbrl-lib [34]", "Adam [21]", "TD3 [14]", "CRR [50]", "MBPO [19]", "MOPO [53]", "gymnasium-robotics [13]", and "D4RL [14]", but it does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes Algorithm-specific hyperparameters were tuned separately for each method through grid search, and kept constant across environments. Each algorithm is allowed the same number of gradient steps (sufficient to ensure convergence for all methods). Appendix I provides tables (e.g., Table S3, Table S4) and text describing specific hyperparameters such as "Batch size 512", "Critic learning rate 1e-5", "# of ensemble members 7", "Population size 400", and "Planning horizon H = 30" (or H=10/15 depending on environment).