Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories
Authors: Qinqing Zheng, Mikael Henaff, Brandon Amos, Aditya Grover
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find this simple pipeline to be highly successful on several D4RL benchmarks (Fu et al., 2020), certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10% of trajectories which are highly suboptimal. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets. |
| Researcher Affiliation | Collaboration | 1Meta AI Research 2UCLA. Correspondence to: Qinqing Zheng <zhengqinqing@gmail.com>. |
| Pseudocode | Yes | Algorithm 1: Semi-supervised offline RL (SS-ORL) and Algorithm 2: Self-Training for the Inverse Dynamics Model. |
| Open Source Code | Yes | For all the offline RL methods we consider, we use our own implementations adopted from the following codebases: DT https://github.com/facebookresearch/online-dt TD3BC https://github.com/sfujim/TD3_BC CQL https://github.com/scottemmons/youngs-cql |
| Open Datasets | Yes | We focus on two Gym locomotion tasks, hopper and walker, with the v2 medium-expert, medium and medium-replay datasets from the D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | Yes | To prevent overfitting, we randomly sample 10% of the labelled trajectories as the validation set, and use the IDM that yields the best validation error within 100k iterations. |
| Hardware Specification | No | The paper does not specify any particular hardware details such as GPU models, CPU models, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions optimizers like LAMB and Adam, but it does not specify version numbers for key software components such as deep learning frameworks (e.g., PyTorch, TensorFlow) or Python. |
| Experiment Setup | Yes | In this section, we provide more details about our experiments. For all the offline RL methods we consider, we use our own implementations adopted from the following codebases: [links to codebases]. Table A.1: The hyperparameters used for DT. Table A.2: The hyperparameters used for TD3BC. Table A.3: The hyperparameters used for CQL. We use batch size 256 and context length 20 for DT, where each batch contains 5120 states. Correspondingly, we use batch size 5120 for CQL and TD3BC. |