Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Flattening Hierarchies with Policy Bootstrapping

Authors: John L. Zhou, Jonathan C. Kao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across a comprehensive suite of stateand pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. ... 6 Experiments To assess SAW s ability to reason over long horizons and handle high-dimensional observations, we conduct experiments across 20 datasets corresponding to 7 locomotion and manipulation environments [Figure 2] with both stateand pixel-based observation spaces. We report performance averaged over 5 state-goal pairs for each dataset, yielding 100 total evaluation tasks.
Researcher Affiliation	Academia	John L. Zhou University of California, Los Angeles EMAIL Jonathan C. Kao University of California, Los Angeles EMAIL
Pseudocode	Yes	Finally, we use GCIVL to learn V , resulting in the full training scheme outlined in Algorithm 1. Algorithm 1 Subgoal Advantage-Weighted Policy Bootstrapping (SAW)
Open Source Code	Yes	Question: Does the paper provide open access to the data and code...? Answer: [Yes] Justification: Anonymized code is attached in the supplementary materials, and a link to an open-source implementation will be added to the abstract after the peer review process.
Open Datasets	Yes	We select several environments and their corresponding datasets from the recently released OGBench suite [37], a comprehensive benchmark specifically designed for offline GCRL. OGBench provides multiple state-goal pairs for evaluation and datasets tailored to evaluate desirable properties of offline GCRL algorithms, such as the ability to reason over long horizons and stitch across multiple trajectories or combinatorial goal sequences. We use the baselines from the original OGBench paper, which include both one-step and hierarchical state-of-the-art offline GCRL methods.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits. It mentions "OGBench provides multiple state-goal pairs for evaluation" and "We report performance averaged over 5 state-goal pairs for each dataset, yielding 100 total evaluation tasks." which refers to evaluation tasks, not the overall train/test/validation split of the dataset. It also describes goal-sampling distributions for training but not dataset splits.
Hardware Specification	Yes	All experiments were conducted on a cluster consisting of Nvidia Ge Force RTX 3090 GPUs with 24 GB of VRAM and Nvidia Ge Force RTX 3070 GPUs with 8 GB of VRAM.
Software Dependencies	No	The paper mentions architectural components like 'Impala encoder' and 'Resnet block' but does not specify software dependencies like programming languages or libraries with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	Table 2: SAW hyperparameters. Each cell indicates the hyperparameters for the corresponding environment and dataset. From left to right, these hyperparameters are: the expectile parameter τ for GCIVL, the one-step AWR temperature α (used for training both the target and policy networks), the temperature on the KL divergence term β, and the number of subgoal steps k. ... Implementation details and hyperparameter settings are discussed in Appendices F and G, respectively.