Beyond Optimism: Exploration With Partially Observable Rewards

Authors: Simone Parisi, Alireza Kazemipour, Michael Bowling

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.
Researcher Affiliation Academia Simone Parisi University of Alberta; Amii parisi@ualberta.ca; Alireza Kazemipour University of Alberta kazemipo@ualberta.ca; Michael Bowling University of Alberta; Amii mbowling@ualberta.ca
Pseudocode Yes Algorithm 1: Directed Exploration-Exploitation
Open Source Code Yes Source code at https://github.com/Amii Thinks/mon_mdp_neurips24.
Open Datasets Yes We validate our exploration strategy on tabular MDPs (Figure 4) characterized by different challenges, e.g., sparse rewards, distracting rewards, stochastic transitions. For each MDP, we propose the following Mon-MDP versions of increasing difficulty.
Dataset Splits No The paper does not explicitly mention using a separate validation set. It describes testing the greedy policies at regular intervals during training.
Hardware Specification Yes We ran our experiments on a SLURM-based cluster, using 32 Intel E5-2683 v4 Broadwell @ 2.1GHz CPUs.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used in the experiments.
Experiment Setup Yes For all algorithms, γ = 0.99 and ϵt starts at 1 and linearly decays to 0. The schedule αt depends on the environment: constant 0.5 for Hazard and Two-Room (3 5) (because of the quicksand cell), linear decay from 0.5 to 0.05 in River Swim (because of the stochastic transition), and constant 1 otherwise. For the Random Experts Monitor we linearly decay the learning rate to 0.1 in all environments (0.05 in River Swim) because of the random monitor state. Discount factor γ = 0.99.