Large Language Models can Implement Policy Iteration
Authors: Ethan Brooks, Logan Walls, Richard L Lewis, Satinder Singh
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the method empirically on six small illustrative RL tasks chain, distractor-chain, maze, mini-catch, mini-invaders, and point-mass in which the method very quickly finds good policies. We also compare five pretrained Large Language Models (LLMs)... |
| Researcher Affiliation | Academia | Ethan Brooks1, Logan Walls2, Richard L. Lewis2, Satinder Singh1 1Computer Science and Engineering, University of Michigan 2Department of Psychology, University of Michigan |
| Pseudocode | Yes | Algorithm 2 Computing Q-values and Algorithm 1 Training Loop are provided in pseudocode format. |
| Open Source Code | Yes | Code for our implementation is available at https://github.com/ethanabrooks/icpi. |
| Open Datasets | Yes | GPT-J (B. Wang et al. 2021) 6 billion The Pile (Leo Gao et al. 2020), an 825GB English corpus incl. Wikipedia, Git Hub, academic pubs, OPT-30B (Zhang et al. 2022) 30 billion 180B tokens of predominantly English data including The Pile (Leo Gao et al. 2020) and Push Shift.io Reddit (Baumgartner et al. 2020). |
| Dataset Splits | No | The paper describes training in the context of reinforcement learning (iterative policy improvement) and does not refer to explicit train/validation/test dataset splits as commonly found in supervised learning experiments. |
| Hardware Specification | Yes | each running on one Nvidia A40 GPU. |
| Software Dependencies | No | For GPT-J (B. Wang et al. 2021), In Coder (Fried et al. 2022) and OPT-30B (Zhang et al. 2022), we used the open-source implementations from Huggingface Transformers (Wolf et al. 2020). Specific version numbers for the software libraries used (e.g., Huggingface Transformers, PyTorch, TensorFlow) are not provided. |
| Experiment Setup | Yes | c = 8 (the number of most recent successful trajectories to include in the prompt). All language models use a sampling temperature of 0.1. All results use 4 seeds. Tabular Q is a standard tabular Q-learning algorithm, which uses a learning rate of 1.0 and optimistically initializes the Q-values to 1.0. PPO hyperparameters: Number of Hidden Layers 1, 2; Hidden Size 256, 512, 1024; Actor Learning Rate 0.001, 0.002, 0.005; Critic Learning Rate 0.0001, 0.0005, 0.001. |