Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories

Authors: Qinqing Zheng, Mikael Henaff, Brandon Amos, Aditya Grover

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we find this simple pipeline to be highly successful on several D4RL benchmarks (Fu et al., 2020), certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10% of trajectories which are highly suboptimal. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.
Researcher Affiliation Collaboration 1Meta AI Research 2UCLA. Correspondence to: Qinqing Zheng <zhengqinqing@gmail.com>.
Pseudocode Yes Algorithm 1: Semi-supervised offline RL (SS-ORL) and Algorithm 2: Self-Training for the Inverse Dynamics Model.
Open Source Code Yes For all the offline RL methods we consider, we use our own implementations adopted from the following codebases: DT https://github.com/facebookresearch/online-dt TD3BC https://github.com/sfujim/TD3_BC CQL https://github.com/scottemmons/youngs-cql
Open Datasets Yes We focus on two Gym locomotion tasks, hopper and walker, with the v2 medium-expert, medium and medium-replay datasets from the D4RL benchmark (Fu et al., 2020).
Dataset Splits Yes To prevent overfitting, we randomly sample 10% of the labelled trajectories as the validation set, and use the IDM that yields the best validation error within 100k iterations.
Hardware Specification No The paper does not specify any particular hardware details such as GPU models, CPU models, or cloud computing instance types used for running the experiments.
Software Dependencies No The paper mentions optimizers like LAMB and Adam, but it does not specify version numbers for key software components such as deep learning frameworks (e.g., PyTorch, TensorFlow) or Python.
Experiment Setup Yes In this section, we provide more details about our experiments. For all the offline RL methods we consider, we use our own implementations adopted from the following codebases: [links to codebases]. Table A.1: The hyperparameters used for DT. Table A.2: The hyperparameters used for TD3BC. Table A.3: The hyperparameters used for CQL. We use batch size 256 and context length 20 for DT, where each batch contains 5120 states. Correspondingly, we use batch size 5120 for CQL and TD3BC.