C-Learning: Learning to Achieve Goals via Recursive Classification
Authors: Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that C-learning more accurately estimates the density over future states, while remaining competitive with recent goal-conditioned RL methods across a suite of simulated robotic tasks. |
| Researcher Affiliation | Collaboration | Benjamin Eysenbach CMU, Google Brain beysenba@cs.cmu.edu Ruslan Salakhutdinov CMU Sergey Levine UC Berkeley, Google Brain |
| Pseudocode | Yes | Algorithm 1 Monte Carlo C-learning (Page 4); Algorithm 2 Off-Policy C-learning (Page 5); Algorithm 3 Goal-Conditioned C-learning (Page 5) |
| Open Source Code | Yes | Project website with videos and code: https://ben-eysenbach.github.io/c_learning/ |
| Open Datasets | Yes | We collected a dataset of experience from agents pretrained to solve three locomotion tasks from Open AI Gym. We used the expert data provided for each task in Fu et al. (2020). |
| Dataset Splits | Yes | We split these trajectories into train (80%) and test (20%) splits. We randomly sampled a 1000 state-action pairs from the validation set and computed the average MSE with the empirical expected future state. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running the experiments, such as GPU models, CPU specifications, or memory. |
| Software Dependencies | No | The paper mentions that "Our implementation of C-learning is based on the TD3 implementation in Guadarrama et al. (2018)" which refers to "Tf-agents: A library for reinforcement learning in tensorflow," implying the use of TensorFlow, but no specific version numbers for TensorFlow or any other software components are provided. |
| Experiment Setup | Yes | Each of the algorithms used a 2 layer neural network with a hidden layer of size 32, optimized for 1000 iterations using the Adam optimizer with a learning rate of 3e-3 and a batch size of 256. All methods (MC C-learning, TD C-learning, and the 1-step dynamics model) used the same architecture (one hidden layer of size 256 with Re LU activation). Actor network: 2 fully-connected layers of size 256 with Re LU activations. Critic network: 2 fully-connected layers of size 256 with Re LU activations. Replay buffer size: 1e6. Target network updates: Polyak averaging at every iteration with τ = 0.005. Batch size: 256. Optimizer: Adam with a learning rate of 3e-4 and default values for β. Data collection: We collect one transition every one gradient step. |