Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning

Authors: Jonathan Cook, Chris Lu, Edward Hughes, Joel Z. Leibo, Jakob Foerster

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of both the in-context and in-weights models by showing sustained generational performance gains on several tasks requiring exploration under partial observability. On each task, we find that accumulating agents outperform those that learn for a single lifetime of the same total experience budget.
Researcher Affiliation Collaboration Jonathan Cook FLAIR, University of Oxford jonathan.cook2@hertford.ox.ac.uk Chris Lu FLAIR, University of Oxford christopher.lu@eng.ox.ac.uk Edward Hughes Google Deep Mind edwardhughes@google.com Joel Z. Leibo Google Deep Mind jzl@google.com Jakob Foerster FLAIR, University of Oxford jakob@eng.ox.ac.uk
Pseudocode Yes Algorithm 1 Training Loop for In-Context Accumulation (changes to RL2 in red); Algorithm 2 In-Context Accumulation During Evaluation
Open Source Code Yes Code can be found at https://github.com/FLAIROx/cultural-accumulation.
Open Datasets No The paper introduces custom environments (Memory Sequence, Goal Sequence, Travelling Salesperson) which are released as part of their open-source code, but does not provide concrete access information for a pre-existing, static public dataset.
Dataset Splits No The paper describes training and testing on environment instances but does not explicitly provide details about a separate validation dataset split.
Hardware Specification Yes Memory Sequence and TSP experiments were run on a single NVIDIA RTX A40 GPU (40GB memory)... Training of in-context learners in Goal Sequence was run in under 8 minutes on 4 A40s... In-weights accumulation in Goal Sequence was run in 30 minutes on 4 A40s.
Software Dependencies No The paper mentions software components like 'Pure Jax RL codebase', 'PPO', and 'S5' but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Appendix F Hyperparameters: population size, learning rate, batch size, rollout length, update epochs, minibatches, γ, λGAE, ϵ clip, entropy coefficient, value coefficient, max gradient norm, anneal learning rate are specified for Memory Sequence, TSP, and Goal Sequence experiments.