Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Authors: Shengran Hu, Jeff Clune
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | here we conduct experiments in a domain where the thinking and action data are synthetically generated. Results reveal that Thought Cloning learns much faster than Behavioral Cloning and its performance advantage grows the further out of distribution test tasks are, highlighting its ability to better handle novel situations. Our experimental results illustrate that Thought Cloning outperforms Behavioral Cloning, even when Behavioral Cloning agents have the ability to think (in latent vectors), but have to learn that skill without the supervision of thinking provided by Thought Cloning. We also demonstrate that Thought Cloning generalizes better than Behavioral Cloning in out-of-distribution tasks in both zero-shot and fine-tuning settings. Finally, we provide empirical evidence for the previously discussed advantages of Thought Cloning in terms of Safety and Interpretability, where unsafe behavior can be near perfectly stopped before execution. |
| Researcher Affiliation | Academia | Shengran Hu1,2 srhu@cs.ubc.ca Jeff Clune1,2,3 jclune@gmail.com 1Department of Computer Science, University of British Columbia 2Vector Institute 3Canada CIFAR AI Chair |
| Pseudocode | Yes | Algorithm 1 Thought Cloning |
| Open Source Code | Yes | The source code, model weights, and dataset are available at https://github.com/Shengran Hu/ Thought-Cloning. |
| Open Datasets | Yes | The source code, model weights, and dataset are available at https://github.com/Shengran Hu/ Thought-Cloning. and This paper employs Baby AI [26], a simulated partially observable 2D gridworld domain. |
| Dataset Splits | No | The training iterates for 8 epochs on the 1 million episode dataset, corresponding to a total of 7 × 10^8 training frames. and In our experiments, the performance of agents is evaluated based on their success rate in held-out test environments. The paper mentions training and testing but does not specify a distinct validation split for hyperparameter tuning. |
| Hardware Specification | Yes | Producing all the main results in the paper took about ten A40 GPUs for one week. |
| Software Dependencies | No | The mix precision in Py Torch is also adopted during training, which speeds up training without sacrificing much performance [61]. The paper mentions PyTorch but no specific version. |
| Experiment Setup | Yes | The training iterates for 8 epochs on the 1 million episode dataset, corresponding to a total of 7 × 10^8 training frames. The Thought Cloning loss parameter α (Eq. 1) is set to 2. During training, we employ teacher-forcing [35], which is adopted when decoding thoughts. It conditions the Action Generator on the ground truth thoughts from the dataset. The teacher-forcing ratio linearly decreases from 100% to 0% during the training process. [...] The Adam optimizer [59] is adopted to train TC and TC variant, with a batch size of 180 and a learning rate of 5e-4. Similar to the setting in baseline [32, 26], we train BC with a batch size of 296 and a learning rate 5e-5. The learning rate schedule begins with a warm-up phase of 5T training steps, where T = 51200, linearly increasing from 1e-4 to 5e-4 for every T step, and then decaying by 50% at 120T training steps, similar to the practices in [26, 60]. The teacher-forcing ratio linearly decreases from 100% to 0% from the 10T training step to the end for every T step. In line 5 of Algorithm 1, the input thought could be the ground truth from the dataset (tht) or the generated thought from the Thought Generator (ˆtht), depending on with or without teacher forcing. For training efficiency, Backpropagation Through Time was truncated at 20 steps in TC. The mix precision in Py Torch is also adopted during training, which speeds up training without sacrificing much performance [61]. In fine-tuning experiments, due to the increased difficulty of the levels and longer steps requiring more memory, we reduced the batch size from 180 to 40 and trained with an auto-regressive strategy. Detailed hyperparameter settings are shown in Table 1. |