Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning
Authors: Jonathan Cook, Chris Lu, Edward Hughes, Joel Z. Leibo, Jakob Foerster
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of both the in-context and in-weights models by showing sustained generational performance gains on several tasks requiring exploration under partial observability. On each task, we find that accumulating agents outperform those that learn for a single lifetime of the same total experience budget. |
| Researcher Affiliation | Collaboration | Jonathan Cook FLAIR, University of Oxford jonathan.cook2@hertford.ox.ac.uk Chris Lu FLAIR, University of Oxford christopher.lu@eng.ox.ac.uk Edward Hughes Google Deep Mind edwardhughes@google.com Joel Z. Leibo Google Deep Mind jzl@google.com Jakob Foerster FLAIR, University of Oxford jakob@eng.ox.ac.uk |
| Pseudocode | Yes | Algorithm 1 Training Loop for In-Context Accumulation (changes to RL2 in red); Algorithm 2 In-Context Accumulation During Evaluation |
| Open Source Code | Yes | Code can be found at https://github.com/FLAIROx/cultural-accumulation. |
| Open Datasets | No | The paper introduces custom environments (Memory Sequence, Goal Sequence, Travelling Salesperson) which are released as part of their open-source code, but does not provide concrete access information for a pre-existing, static public dataset. |
| Dataset Splits | No | The paper describes training and testing on environment instances but does not explicitly provide details about a separate validation dataset split. |
| Hardware Specification | Yes | Memory Sequence and TSP experiments were run on a single NVIDIA RTX A40 GPU (40GB memory)... Training of in-context learners in Goal Sequence was run in under 8 minutes on 4 A40s... In-weights accumulation in Goal Sequence was run in 30 minutes on 4 A40s. |
| Software Dependencies | No | The paper mentions software components like 'Pure Jax RL codebase', 'PPO', and 'S5' but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Appendix F Hyperparameters: population size, learning rate, batch size, rollout length, update epochs, minibatches, γ, λGAE, ϵ clip, entropy coefficient, value coefficient, max gradient norm, anneal learning rate are specified for Memory Sequence, TSP, and Goal Sequence experiments. |