Large Language Models can Implement Policy Iteration

Authors: Ethan Brooks, Logan Walls, Richard L Lewis, Satinder Singh

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the method empirically on six small illustrative RL tasks chain, distractor-chain, maze, mini-catch, mini-invaders, and point-mass in which the method very quickly finds good policies. We also compare five pretrained Large Language Models (LLMs)...
Researcher Affiliation Academia Ethan Brooks1, Logan Walls2, Richard L. Lewis2, Satinder Singh1 1Computer Science and Engineering, University of Michigan 2Department of Psychology, University of Michigan
Pseudocode Yes Algorithm 2 Computing Q-values and Algorithm 1 Training Loop are provided in pseudocode format.
Open Source Code Yes Code for our implementation is available at https://github.com/ethanabrooks/icpi.
Open Datasets Yes GPT-J (B. Wang et al. 2021) 6 billion The Pile (Leo Gao et al. 2020), an 825GB English corpus incl. Wikipedia, Git Hub, academic pubs, OPT-30B (Zhang et al. 2022) 30 billion 180B tokens of predominantly English data including The Pile (Leo Gao et al. 2020) and Push Shift.io Reddit (Baumgartner et al. 2020).
Dataset Splits No The paper describes training in the context of reinforcement learning (iterative policy improvement) and does not refer to explicit train/validation/test dataset splits as commonly found in supervised learning experiments.
Hardware Specification Yes each running on one Nvidia A40 GPU.
Software Dependencies No For GPT-J (B. Wang et al. 2021), In Coder (Fried et al. 2022) and OPT-30B (Zhang et al. 2022), we used the open-source implementations from Huggingface Transformers (Wolf et al. 2020). Specific version numbers for the software libraries used (e.g., Huggingface Transformers, PyTorch, TensorFlow) are not provided.
Experiment Setup Yes c = 8 (the number of most recent successful trajectories to include in the prompt). All language models use a sampling temperature of 0.1. All results use 4 seeds. Tabular Q is a standard tabular Q-learning algorithm, which uses a learning rate of 1.0 and optimistically initializes the Q-values to 1.0. PPO hyperparameters: Number of Hidden Layers 1, 2; Hidden Size 256, 512, 1024; Actor Learning Rate 0.001, 0.002, 0.005; Critic Learning Rate 0.0001, 0.0005, 0.001.