Context-dependent upper-confidence bounds for directed exploration
Authors: Raksha Kumaraswamy, Matthew Schlegel, Adam White, Martha White
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that our algorithm can converge more quickly than other incremental exploration strategies using confidence estimates on action-values. We demonstrate in several simulated domains that UCLS outperforms DGPQ, UCBootstrap, and RLSVI. Our experiments were intentionally conducted in small though carefully selected simulation domains so that we could conduct extensive parameter sweeps, hundreds of runs for averaging, and compare numerous state-of-the-art exploration algorithms. |
| Researcher Affiliation | Collaboration | Raksha Kumaraswamy1, Matthew Schlegel1, Adam White1,2, Martha White1 1Department of Computing Science, University of Alberta; 2Deep Mind |
| Pseudocode | Yes | The complete psuedocode for UCLS is given in the Appendix (Algorithm 2). |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is released or provide a link to it. |
| Open Datasets | Yes | Sparse Mountain Car is a version of classic mountain car problem Sutton and Barto [40]... River Swim is a standard continuing exploration benchmark [42]... |
| Dataset Splits | No | The paper describes the environments and experimental budget (e.g., 50,000 steps, episode cutoff), but does not specify dataset splits (e.g., train/validation/test percentages or counts) as typically found in supervised learning tasks. |
| Hardware Specification | No | The paper mentions "Calcul Québec (www.calculquebec.ca) and Compute Canada (www.computecanada.ca) for the computing resources used in this work," but does not specify any particular hardware models (e.g., GPU/CPU types, memory). |
| Software Dependencies | No | The paper mentions DGPQ uses a kernel-based representation and refers to algorithms like Sarsa, but does not provide specific version numbers for any software libraries, frameworks, or languages used. |
| Experiment Setup | Yes | Our primary concern is early learning performance, thus each experiment is restricted to 50,000 steps, with an episode cutoff (in Sparse Mountain Car and Puddle World) at 10,000 steps. For all the algorithms that utilize eligibility traces we set λ to be 0.9. For algorithms which use exponential averaging, β is set to 0.001, and the regularizer is set to be 0.0001. The parameters for UCLS are fixed. All the algorithms except DGPQ use the same representation: (1) Sparse Mountain Car 8 tilings of 8x8, hashed to a memory space of 512, (2) River Swim 4 tilings of granularity 32, hashed to a memory space of 128, and (3) Puddle World 5 tilings of granularity 5x5, hashed to a memory space of 128. |