First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

Authors: Ben Norman, Jeff Clune

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Standard reinforcement learning (RL) agents never intelligently explore like a human (i.e. taking into account complex domain priors and adapting quickly based on previous exploration). Across episodes, RL agents struggle to perform even simple exploration strategies, for example systematic search that avoids exploring the same location multiple times. This poor exploration limits performance on challenging domains. Meta-RL is a potential solution, as unlike standard RL, meta-RL can learn to explore, and potentially learn highly complex strategies far beyond those of standard RL, strategies such as experimenting in early episodes to learn new skills, or conducting experiments to learn about the current environment. Traditional meta-RL focuses on the problem of learning to optimally balance exploration and exploitation to maximize the cumulative reward of the episode sequence (e.g., aiming to maximize the total wins in a tournament while also improving as a player). We identify a new challenge with state-of-the-art cumulative-reward meta-RL methods. When optimal behavior requires exploration that sacrifices immediate reward to enable higher subsequent reward, existing state-of-the-art cumulative-reward meta-RL methods become stuck on the local optimum of failing to explore. Our method, First-Explore, overcomes this limitation by learning two policies: one to solely explore, and one to solely exploit. When exploring requires forgoing early-episode reward, First-Explore significantly outperforms existing cumulative meta-RL methods. By identifying and solving the previously unrecognized problem of forgoing reward in early episodes, First-Explore represents a significant step towards developing meta-RL algorithms capable of human-like exploration on a broader range of domains. On all three a) the meta-RL controls perform poorly, and b) First-Explore significantly outperforms the controls. Further, two of the domains have modified versions that do not require forgoing immediate rewards, and this change causes significant control policy improvement3.
Researcher Affiliation Collaboration Ben Norman1,2 btnorman@cs.ubc.ca Jeff Clune1,2,3 jclune@gmail.com 1Department of Computer Science, University of British Columbia 2Vector Institute 3Canada CIFAR AI Chair
Pseudocode Yes B Training Pseudocode def rollout(env, π, ψ, cπ, cψ):
Open Source Code Yes To ensure full replicability, we are releasing the code used to train First-Explore and the controls, along with the environments trained on. We are also releasing the weights of a trained model for each domain. Each model contains both the explore and exploit policies as separate heads on the shared trunk. The code is available at https://github.com/btnorman/First-Explore.
Open Datasets No The paper uses custom-generated environments (e.g., Bandits with One Fixed Arm, Dark Treasure Rooms, Ray Maze) rather than pre-existing publicly available datasets. The environments are described in Appendix C.
Dataset Splits No No explicit mention of specific training, validation, and test dataset splits with percentages or sample counts in the traditional sense, as the paper generates environments on the fly rather than using static datasets.
Hardware Specification Yes Each training run commanded a single GPU, specifically a Nvidia T4, and up to 8 cpu cores.
Software Dependencies No The architecture for both domains is a GPT-2 transformer architecture [13] specifically the Jax framework [26] implementation provided by Hugging Face [27], with the code being modified so that token embeddings could be passed rather than token IDs. For training we use Adam W [28] with a piece-wise linear warm up schedule...
Experiment Setup Yes Table 8: Optimization Hyperparameters [lists Batch Size, Optimizer, Weight Decay, Learning Rate]. Table 9: Training Rollout Hyperparameters [lists Exploit Sampling Temperature, Explore Sampling Temperature, Policy Update Frequency, ϵ chance of random action selection, Baseline Reward, Training Updates].