Anti-Concentrated Confidence Bonuses For Scalable Exploration

Authors: Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, Sham M. Kakade

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks. Our algorithm is both theoretically principled and computationally tractable, and we demonstrate that its empirical performance on Atari games is often competitive with popular baselines.
Researcher Affiliation Collaboration Jordan T. Ash Microsoft Research NYC Cyril Zhang Microsoft Research NYC Surbhi Goel Microsoft Research NYC Akshay Krishnamurthy Microsoft Research NYC Sham Kakade Microsoft Research NYC Harvard University
Pseudocode Yes Algorithm 1 ACB for linear bandits, Algorithm 2 ACB exploration for reinforcement learning, Algorithm 3 Lazy-ACB for fixed action linear bandits
Open Source Code No See attached code for more details. (No concrete access link or explicit statement of public release)
Open Datasets Yes on a variety of Atari benchmarks (Figure 1).
Dataset Splits No No explicit training/validation/test dataset splits (percentages, sample counts, or predefined citations) were provided.
Hardware Specification Yes Compute resources per job included a single CPU and either a P100 of V100 NVIDIA GPU, with experiments each taking between three and five days.
Software Dependencies Yes All model updates are performed via Adam except for the ACB auxiliary weights, which use RMSprop. Policy optimization is done with PPO (Schulman et al., 2017).
Experiment Setup Yes In all experiments, we use 128 parallel agents and rollouts of 128 timesteps, making τ in Algorithm 2 1282. In experiments shown here, α is fixed at 10 6. We use an ensemble of 128 auxiliary weights and regularize aggressively towards their initialization with λ = 103.