C-Learning: Learning to Achieve Goals via Recursive Classification

Authors: Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that C-learning more accurately estimates the density over future states, while remaining competitive with recent goal-conditioned RL methods across a suite of simulated robotic tasks.
Researcher Affiliation Collaboration Benjamin Eysenbach CMU, Google Brain beysenba@cs.cmu.edu Ruslan Salakhutdinov CMU Sergey Levine UC Berkeley, Google Brain
Pseudocode Yes Algorithm 1 Monte Carlo C-learning (Page 4); Algorithm 2 Off-Policy C-learning (Page 5); Algorithm 3 Goal-Conditioned C-learning (Page 5)
Open Source Code Yes Project website with videos and code: https://ben-eysenbach.github.io/c_learning/
Open Datasets Yes We collected a dataset of experience from agents pretrained to solve three locomotion tasks from Open AI Gym. We used the expert data provided for each task in Fu et al. (2020).
Dataset Splits Yes We split these trajectories into train (80%) and test (20%) splits. We randomly sampled a 1000 state-action pairs from the validation set and computed the average MSE with the empirical expected future state.
Hardware Specification No The paper does not provide specific details on the hardware used for running the experiments, such as GPU models, CPU specifications, or memory.
Software Dependencies No The paper mentions that "Our implementation of C-learning is based on the TD3 implementation in Guadarrama et al. (2018)" which refers to "Tf-agents: A library for reinforcement learning in tensorflow," implying the use of TensorFlow, but no specific version numbers for TensorFlow or any other software components are provided.
Experiment Setup Yes Each of the algorithms used a 2 layer neural network with a hidden layer of size 32, optimized for 1000 iterations using the Adam optimizer with a learning rate of 3e-3 and a batch size of 256. All methods (MC C-learning, TD C-learning, and the 1-step dynamics model) used the same architecture (one hidden layer of size 256 with Re LU activation). Actor network: 2 fully-connected layers of size 256 with Re LU activations. Critic network: 2 fully-connected layers of size 256 with Re LU activations. Replay buffer size: 1e6. Target network updates: Polyak averaging at every iteration with τ = 0.005. Batch size: 256. Optimizer: Adam with a learning rate of 3e-4 and default values for β. Data collection: We collect one transition every one gradient step.