Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Robust Zero-Shot Reinforcement Learning

Authors: Kexin ZHENG, Lauriane Teyssier, Yinan Zheng, Yu Luo, Xianyuan Zhan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on Ex ORL and D4RL Kitchen demonstrate that BREEZE achieves the best or near-the-best performance while exhibiting superior robustness compared to prior offline zero-shot RL methods. We conduct extensive experiments on the Ex ORL benchmark [70] and the D4RL Kitchen dataset [14], under both full datasets and small-sample data regimes.
Researcher Affiliation	Collaboration	1 The Chinese University of Hong Kong 2 Tsinghua University 3 Huawei Noah s Ark Lab 4 Shanghai Artificial Intelligence Laboratory
Pseudocode	Yes	C Pseudo-Code Algorithm 1 provides the pseudocode.
Open Source Code	Yes	The official implementation is available at: https://github.com/Whiterrrrr/BREEZE.
Open Datasets	Yes	Extensive experiments on Ex ORL and D4RL Kitchen demonstrate that BREEZE achieves the best or near-the-best performance while exhibiting superior robustness compared to prior offline zero-shot RL methods. The official implementation is available at: https://github.com/Whiterrrrr/BREEZE. We conducted extensive experiments on the Ex ORL benchmark [70] and the D4RL Kitchen dataset [14], under both full datasets and small-sample data regimes.
Dataset Splits	No	Our main experiments are conducted on the Ex ORL benchmark [70], which provides a variety of datasets collected by several unsupervised RL algorithms [34]. We select datasets collected by 4 algorithms: RND [5], APS [38], DIAYN [11], and PROTO [69]. The experiments span 3 domains and 12 tasks (Walker: Stand, Walk, Run, Flip; Jaco: Reach Top/Bottom Left/Right; Quadruped: Stand, Walk, Run, Jump), bringing the total to 48 state-based complex tasks for performance calculation. In addition, we consider four challenging multi-stage tasks in the D4RL [14] Franka Kitchen domain [19] with two datasets (mixed and partial), which require long-horizon sequential manipulation on 4 subtasks.
Hardware Specification	Yes	Our implementation uses Py Torch [45], with all experiments conducted on a single NVIDIA A6000 GPU. Experiments are conducted on a single NVIDIA A6000 GPU.
Software Dependencies	No	Our implementation uses Py Torch [45], with all experiments conducted on a single NVIDIA A6000 GPU.
Experiment Setup	Yes	This section provides the detailed hyperparameter setup. In our experiments, the model architecture and basic algorithm hyperparameters remain unchanged, as detailed in Table 4. Domain-specific hyperparameters are detailed in Table 5 and Table 6.