Chain of Thought Imitation with Procedure Cloning
Authors: Mengjiao (Sherry) Yang, Dale Schuurmans, Pieter Abbeel, Ofir Nachum
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through empirical analysis on navigation, simulated robotic manipulation, and game-playing environments, we show that imitating the intermediate computations of an expert s behavior enables procedure cloning to learn policies exhibiting significant generalization to unseen environment configurations, including those configurations for which running the expert s procedure directly is infeasible. |
| Researcher Affiliation | Collaboration | Mengjiao Yang Google Brain, UC Berkeley sherryy@google.com; Dale Schuurmans Google Brain, Unversity of Alberta schuurmans@google.com; Pieter Abbeel UC Berkeley pabbeel@cs.berkeley.edu; Ofir Nachum Google Brain ofirnachum@google.com |
| Pseudocode | Yes | See pseudocode for collecting procedure observations in Appendix A.2. |
| Open Source Code | Yes | https://github.com/google-research/google-research/tree/master/procedure_cloning. |
| Open Datasets | Yes | We adopt the Ant Maze environment from D4RL [82], Min Atar is a minature version of the Atari Arcade Learning Environment [85]. We generate a set of mazes S0 S and split S0 into disjoint training Strain 0 and testing Stest 0 sets. We then generate expert trajectories by running BFS on only the training set of mazes Strain 0. |
| Dataset Splits | No | The paper mentions 'early stopping' which implies the use of a validation set, but it does not provide explicit details about the split (e.g., percentages or counts) of the validation data. |
| Hardware Specification | Yes | We use 4 Nvidia V100 GPUs for each experiment unless otherwise specified. |
| Software Dependencies | No | The paper mentions 'We use JAX [87] for implementation.' but does not specify the version number for JAX or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | BC policies are parametrized with convolutional neural networks (CNN) and multi-layer perceptrons (MLPs) (see hyperparameters in Appendix A.5). |