C-Learning: Horizon-Aware Cumulative Accessibility Estimation
Authors: Panteha Naderian, Gabriel Loaiza-Ganem, Harry J. Braviner, Anthony L. Caterini, Jesse C. Cresswell, Tong Li, Animesh Garg
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on a set of multi-goal discrete and continuous control tasks. We show that our method outperforms state-of-the-art goal-reaching algorithms in success rate, sample complexity, and path optimality. |
| Researcher Affiliation | Collaboration | Panteha Naderian, Gabriel Loaiza-Ganem, Harry J. Braviner, Anthony L. Caterini, Jesse C. Cresswell & Tong Li Layer 6 AI {panteha, gabriel, harry, anthony, jesse, tong}@layer6.ai Animesh Garg University of Toronto, Vector Institute, Nvidia garg@cs.toronto.edu |
| Pseudocode | Yes | Algorithm 1: Training C-learning Network |
| Open Source Code | Yes | Our code is available at https://github.com/layer6ai-labs/CAE |
| Open Datasets | Yes | 3. Fetch Pick And Place-v1 (Brockman et al., 2016) is a complex, higher-dimensional environment in which a robotic arm needs to pick up a block and move it to the goal location... 4. Hand Manipulate Pen Full-v0 (Brockman et al., 2016) is a realistic environment known the be a difficult goal-reaching problem... |
| Dataset Splits | No | The paper mentions training and testing but does not explicitly provide details about a validation dataset split or percentages. |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU/GPU models or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, such as Python versions or library versions. |
| Experiment Setup | Yes | For all methods, we train for 300 episodes, each one of maximal length 50 steps, we use a learning rate 10 3, a batch size of size 256, and train for 64 gradient steps per episode. We use a 0.1-greedy for the behavior policy. We use a neural network with two hidden layers of respective sizes 60 and 40 with Re LU activations. We use 15 fully random exploration episodes before we start training. We take p(s0) as uniform among non-hole states during training, and set it as a point mass at (1, 0) for testing. We set p(g) as uniform among states during training, and we evaluate at every goal during testing. For C-learning, we use κ = 3, and copy the target network every 10 steps. |