Contrastive Learning as Goal-Conditioned Reinforcement Learning

Authors: Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, Russ R. Salakhutdinov

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across a range of goal-conditioned RL tasks, we demonstrate that contrastive RL methods achieve higher success rates than prior non-contrastive methods, including in the offline RL setting. We also show that contrastive RL outperforms prior methods on image-based tasks, without using data augmentation or auxiliary objectives. 1
Researcher Affiliation Collaboration Benjamin Eysenbachα,β Tianjun Zhangγ Sergey Levineβ,γ Ruslan Salakhutdinovα αCMU βGoogle Research γUC Berkeley
Pseudocode Yes Alg. 1 provides a JAX [13] implementation of the actor and critic losses.
Open Source Code Yes 1Project website with videos and code: https://ben-eysenbach.github.io/contrastive_rl
Open Datasets Yes We use the benchmark Ant Maze tasks from the D4RL benchmark [36]
Dataset Splits No The paper mentions using a replay buffer, environment steps, and batch sizes for training. For the offline RL setting, it uses the D4RL benchmark, but it does not explicitly state specific training, validation, and test dataset splits (e.g., percentages or sample counts) within the text.
Hardware Specification Yes On a single TPUv2, training proceeds at 1100 batches/sec for state-based tasks and 105 batches/sec for image-based tasks; for comparison, our implementation of Dr Q on the same hardware setup runs at 28 batches/sec (3.9 slower).
Software Dependencies No The paper states that its implementation is based on JAX [13] and ACME [57], but it does not provide specific version numbers for these software dependencies.
Experiment Setup Yes Architectures and hyperparameters are described in Appendix E.7; We use a replay buffer size of 106 for all tasks. For state-based tasks, the training proceeds for 3 million environment steps. For image-based tasks, training proceeds for 1 million environment steps. Each policy update uses a batch size of 256. For state-based tasks, we take 1000 critic steps and 1000 actor steps. For image-based tasks, we take 250 critic steps and 250 actor steps.