How to Leverage Unlabeled Data in Offline Reinforcement Learning
Authors: Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, Sergey Levine
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings. In our experiments, we aim to evaluate whether the theoretical potential for simple minimum-reward relabeling to attain good results is reflected in benchmark tasks and more complex offline RL settings. |
| Researcher Affiliation | Collaboration | 1Stanford University 2Google Research 3UC Berkeley. |
| Pseudocode | No | The paper provides mathematical formulations for optimization objectives in Appendix G.1 but does not present them in a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | Single-task hopper domains. We use the hopper environment and datasets from D4RL (Fu et al., 2020). Multi-task Meta-World domains. We use the door open, door close, drawer open and drawer close environments introduced in (Yu et al., 2021a) from the public Meta-World (Yu et al., 2020b) repo1. 1The Meta-World environment can be found at the open-sourced repo https://github.com/rlworkgroup/metaworld |
| Dataset Splits | No | The paper describes the datasets used and mentions training and testing, but it does not provide specific details on how the data was split into training, validation, and test sets (e.g., percentages or exact counts for each split). |
| Hardware Specification | Yes | We train UDS and CDS+UDS on a single NVIDIA Ge Force RTX 2080 Ti for one day on the state-based domains. For the vision-based robotic picking and placing experiments, it takes 3 days to train it on 16 TPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | For more details on experimental set-up and hyperparameter settings, please see Appendix G. On the hopper domain, when the unlabeled data is random, we use the version of CQL that does not maximize the term Es,a DL DU h ˆQ(s, a) i to prevent overestimating Q-values on low-quality random data and use β = 1.0. We use β = 5.0 in the other settings in the hopper domain. |