Contrastive Learning as Goal-Conditioned Reinforcement Learning
Authors: Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, Russ R. Salakhutdinov
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across a range of goal-conditioned RL tasks, we demonstrate that contrastive RL methods achieve higher success rates than prior non-contrastive methods, including in the offline RL setting. We also show that contrastive RL outperforms prior methods on image-based tasks, without using data augmentation or auxiliary objectives. 1 |
| Researcher Affiliation | Collaboration | Benjamin Eysenbachα,β Tianjun Zhangγ Sergey Levineβ,γ Ruslan Salakhutdinovα αCMU βGoogle Research γUC Berkeley |
| Pseudocode | Yes | Alg. 1 provides a JAX [13] implementation of the actor and critic losses. |
| Open Source Code | Yes | 1Project website with videos and code: https://ben-eysenbach.github.io/contrastive_rl |
| Open Datasets | Yes | We use the benchmark Ant Maze tasks from the D4RL benchmark [36] |
| Dataset Splits | No | The paper mentions using a replay buffer, environment steps, and batch sizes for training. For the offline RL setting, it uses the D4RL benchmark, but it does not explicitly state specific training, validation, and test dataset splits (e.g., percentages or sample counts) within the text. |
| Hardware Specification | Yes | On a single TPUv2, training proceeds at 1100 batches/sec for state-based tasks and 105 batches/sec for image-based tasks; for comparison, our implementation of Dr Q on the same hardware setup runs at 28 batches/sec (3.9 slower). |
| Software Dependencies | No | The paper states that its implementation is based on JAX [13] and ACME [57], but it does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Architectures and hyperparameters are described in Appendix E.7; We use a replay buffer size of 106 for all tasks. For state-based tasks, the training proceeds for 3 million environment steps. For image-based tasks, training proceeds for 1 million environment steps. Each policy update uses a batch size of 256. For state-based tasks, we take 1000 critic steps and 1000 actor steps. For image-based tasks, we take 250 critic steps and 250 actor steps. |