Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How to Leverage Unlabeled Data in Offline Reinforcement Learning

Authors: Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, Sergey Levine

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings. In our experiments, we aim to evaluate whether the theoretical potential for simple minimum-reward relabeling to attain good results is reflected in benchmark tasks and more complex offline RL settings.
Researcher Affiliation	Collaboration	1Stanford University 2Google Research 3UC Berkeley.
Pseudocode	No	The paper provides mathematical formulations for optimization objectives in Appendix G.1 but does not present them in a clearly labeled pseudocode or algorithm block.
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets	Yes	Single-task hopper domains. We use the hopper environment and datasets from D4RL (Fu et al., 2020). Multi-task Meta-World domains. We use the door open, door close, drawer open and drawer close environments introduced in (Yu et al., 2021a) from the public Meta-World (Yu et al., 2020b) repo1. 1The Meta-World environment can be found at the open-sourced repo https://github.com/rlworkgroup/metaworld
Dataset Splits	No	The paper describes the datasets used and mentions training and testing, but it does not provide specific details on how the data was split into training, validation, and test sets (e.g., percentages or exact counts for each split).
Hardware Specification	Yes	We train UDS and CDS+UDS on a single NVIDIA Ge Force RTX 2080 Ti for one day on the state-based domains. For the vision-based robotic picking and placing experiments, it takes 3 days to train it on 16 TPUs.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	For more details on experimental set-up and hyperparameter settings, please see Appendix G. On the hopper domain, when the unlabeled data is random, we use the version of CQL that does not maximize the term Es,a DL DU h ˆQ(s, a) i to prevent overestimating Q-values on low-quality random data and use β = 1.0. We use β = 5.0 in the other settings in the hopper domain.