Beyond Optimism: Exploration With Partially Observable Rewards
Authors: Simone Parisi, Alireza Kazemipour, Michael Bowling
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones. |
| Researcher Affiliation | Academia | Simone Parisi University of Alberta; Amii parisi@ualberta.ca; Alireza Kazemipour University of Alberta kazemipo@ualberta.ca; Michael Bowling University of Alberta; Amii mbowling@ualberta.ca |
| Pseudocode | Yes | Algorithm 1: Directed Exploration-Exploitation |
| Open Source Code | Yes | Source code at https://github.com/Amii Thinks/mon_mdp_neurips24. |
| Open Datasets | Yes | We validate our exploration strategy on tabular MDPs (Figure 4) characterized by different challenges, e.g., sparse rewards, distracting rewards, stochastic transitions. For each MDP, we propose the following Mon-MDP versions of increasing difficulty. |
| Dataset Splits | No | The paper does not explicitly mention using a separate validation set. It describes testing the greedy policies at regular intervals during training. |
| Hardware Specification | Yes | We ran our experiments on a SLURM-based cluster, using 32 Intel E5-2683 v4 Broadwell @ 2.1GHz CPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | For all algorithms, γ = 0.99 and ϵt starts at 1 and linearly decays to 0. The schedule αt depends on the environment: constant 0.5 for Hazard and Two-Room (3 5) (because of the quicksand cell), linear decay from 0.5 to 0.05 in River Swim (because of the stochastic transition), and constant 1 otherwise. For the Random Experts Monitor we linearly decay the learning rate to 0.1 in all environments (0.05 in River Swim) because of the random monitor state. Discount factor γ = 0.99. |