reproducibilityindex.ai

Large Language Models can Implement Policy Iteration

Authors: Ethan Brooks, Logan Walls, Richard L Lewis, Satinder Singh

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate the method empirically on six small illustrative RL tasks chain, distractor-chain, maze, mini-catch, mini-invaders, and point-mass in which the method very quickly finds good policies. We also compare five pretrained Large Language Models (LLMs)...
Researcher Affiliation	Academia	Ethan Brooks1, Logan Walls2, Richard L. Lewis2, Satinder Singh1 1Computer Science and Engineering, University of Michigan 2Department of Psychology, University of Michigan
Pseudocode	Yes	Algorithm 2 Computing Q-values and Algorithm 1 Training Loop are provided in pseudocode format.
Open Source Code	Yes	Code for our implementation is available at https://github.com/ethanabrooks/icpi.
Open Datasets	Yes	GPT-J (B. Wang et al. 2021) 6 billion The Pile (Leo Gao et al. 2020), an 825GB English corpus incl. Wikipedia, Git Hub, academic pubs, OPT-30B (Zhang et al. 2022) 30 billion 180B tokens of predominantly English data including The Pile (Leo Gao et al. 2020) and Push Shift.io Reddit (Baumgartner et al. 2020).
Dataset Splits	No	The paper describes training in the context of reinforcement learning (iterative policy improvement) and does not refer to explicit train/validation/test dataset splits as commonly found in supervised learning experiments.
Hardware Specification	Yes	each running on one Nvidia A40 GPU.
Software Dependencies	No	For GPT-J (B. Wang et al. 2021), In Coder (Fried et al. 2022) and OPT-30B (Zhang et al. 2022), we used the open-source implementations from Huggingface Transformers (Wolf et al. 2020). Specific version numbers for the software libraries used (e.g., Huggingface Transformers, PyTorch, TensorFlow) are not provided.
Experiment Setup	Yes	c = 8 (the number of most recent successful trajectories to include in the prompt). All language models use a sampling temperature of 0.1. All results use 4 seeds. Tabular Q is a standard tabular Q-learning algorithm, which uses a learning rate of 1.0 and optimistically initializes the Q-values to 1.0. PPO hyperparameters: Number of Hidden Layers 1, 2; Hidden Size 256, 512, 1024; Actor Learning Rate 0.001, 0.002, 0.005; Critic Learning Rate 0.0001, 0.0005, 0.001.