reproducibilityindex.ai

Anti-Concentrated Confidence Bonuses For Scalable Exploration

Authors: Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, Sham M. Kakade

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks. Our algorithm is both theoretically principled and computationally tractable, and we demonstrate that its empirical performance on Atari games is often competitive with popular baselines.
Researcher Affiliation	Collaboration	Jordan T. Ash Microsoft Research NYC Cyril Zhang Microsoft Research NYC Surbhi Goel Microsoft Research NYC Akshay Krishnamurthy Microsoft Research NYC Sham Kakade Microsoft Research NYC Harvard University
Pseudocode	Yes	Algorithm 1 ACB for linear bandits, Algorithm 2 ACB exploration for reinforcement learning, Algorithm 3 Lazy-ACB for ﬁxed action linear bandits
Open Source Code	No	See attached code for more details. (No concrete access link or explicit statement of public release)
Open Datasets	Yes	on a variety of Atari benchmarks (Figure 1).
Dataset Splits	No	No explicit training/validation/test dataset splits (percentages, sample counts, or predefined citations) were provided.
Hardware Specification	Yes	Compute resources per job included a single CPU and either a P100 of V100 NVIDIA GPU, with experiments each taking between three and ﬁve days.
Software Dependencies	Yes	All model updates are performed via Adam except for the ACB auxiliary weights, which use RMSprop. Policy optimization is done with PPO (Schulman et al., 2017).
Experiment Setup	Yes	In all experiments, we use 128 parallel agents and rollouts of 128 timesteps, making τ in Algorithm 2 1282. In experiments shown here, α is ﬁxed at 10 6. We use an ensemble of 128 auxiliary weights and regularize aggressively towards their initialization with λ = 103.