Anti-Concentrated Confidence Bonuses For Scalable Exploration
Authors: Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, Sham M. Kakade
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks. Our algorithm is both theoretically principled and computationally tractable, and we demonstrate that its empirical performance on Atari games is often competitive with popular baselines. |
| Researcher Affiliation | Collaboration | Jordan T. Ash Microsoft Research NYC Cyril Zhang Microsoft Research NYC Surbhi Goel Microsoft Research NYC Akshay Krishnamurthy Microsoft Research NYC Sham Kakade Microsoft Research NYC Harvard University |
| Pseudocode | Yes | Algorithm 1 ACB for linear bandits, Algorithm 2 ACB exploration for reinforcement learning, Algorithm 3 Lazy-ACB for fixed action linear bandits |
| Open Source Code | No | See attached code for more details. (No concrete access link or explicit statement of public release) |
| Open Datasets | Yes | on a variety of Atari benchmarks (Figure 1). |
| Dataset Splits | No | No explicit training/validation/test dataset splits (percentages, sample counts, or predefined citations) were provided. |
| Hardware Specification | Yes | Compute resources per job included a single CPU and either a P100 of V100 NVIDIA GPU, with experiments each taking between three and five days. |
| Software Dependencies | Yes | All model updates are performed via Adam except for the ACB auxiliary weights, which use RMSprop. Policy optimization is done with PPO (Schulman et al., 2017). |
| Experiment Setup | Yes | In all experiments, we use 128 parallel agents and rollouts of 128 timesteps, making τ in Algorithm 2 1282. In experiments shown here, α is fixed at 10 6. We use an ensemble of 128 auxiliary weights and regularize aggressively towards their initialization with λ = 103. |