Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

Authors: Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, Giovanni Montana

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals. In the experiments, we evaluate GOPlan across multiple multi-goal navigation and manipulation tasks, and demonstrate that GOPlan outperforms prior state-of-the-art (SOTA) offline GCRL algorithms.
Researcher Affiliation Academia Mianchu Wang EMAIL University of Warwick Rui Yang EMAIL Hong Kong University of Science and Technology Xi Chen EMAIL Tsinghua University Hao Sun EMAIL University of Cambridge Meng Fang EMAIL University of Liverpool Giovanni Montana EMAIL University of Warwick
Pseudocode Yes Algorithm 1 Goal-conditioned Offline Planning (GOPlan). Algorithm 2 Model-based Planning.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository. The text mentions
Open Datasets Yes To conduct benchmark experiments, we utilize offline datasets from (Yang et al., 2022b). The dataset contains 1 10^5 transitions for low-dimensional tasks and 2 10^6 for four high-dimensional tasks (Fetch Push, Fetch Pick, Fetch Slide and Hand Reach).
Dataset Splits No The paper describes the total size of the datasets used and mentions creating smaller subsets (e.g., "-s" and "-es" containing 10% and 1% of the transitions). However, it does not explicitly provide details about training, validation, or test splits for these offline datasets, assuming the entire offline dataset is used for learning, and evaluation is performed on the environment.
Hardware Specification Yes ACGAN-plan is stable and efficient, but it imposes excessive computation for online interaction (30 Hz on GTX 3090).
Software Dependencies No The paper mentions using "Adam optimizer" and refers to a policy implemented with "3-layer multi-layer perceptrons" and "batch normalization," but it does not specify software versions for any programming languages, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes We use Adam optimizer with a learning rate of 1 10^-4. During the fine-tuning phase, we gather Jintra = 2000 and Jinter = 2000 trajectories for intra-trajectory reanalysis and inter-trajectory reanalysis, respectively. After each collection, the prior policy is finetuned with 500 gradient steps. This process is performed a total of I = 10 times. In order to show the difference in the ablation study, we use Jintra = 200 and Jinter = 200 there. Table 4 gives a list and description of them, as well as their default values. (Table 4 includes: γ discount factor 0.98, β coefficient in the exponential advantage weight 60, N Number of dynamics models 3, ACGAN noise dimension 64, Batch size 512, I Finetuning episodes 10, Iintra # collected Intra-trajectories each episode 1000, Iinter # collected Inter-trajectories each episode 1000, u Uncertainty threshold 0.1, Finetuning updates every episode 1000, Pretraining updates 5 10^5, Size of reanalysis buffer 2 10^6, κ Exponential weight 2, K Planning horizon 20, C Number of initial actions 20, H Copies of initial actions 10)