reproducibilityindex.ai

Beyond Optimism: Exploration With Partially Observable Rewards

Authors: Simone Parisi, Alireza Kazemipour, Michael Bowling

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.
Researcher Affiliation	Academia	Simone Parisi University of Alberta; Amii parisi@ualberta.ca; Alireza Kazemipour University of Alberta kazemipo@ualberta.ca; Michael Bowling University of Alberta; Amii mbowling@ualberta.ca
Pseudocode	Yes	Algorithm 1: Directed Exploration-Exploitation
Open Source Code	Yes	Source code at https://github.com/Amii Thinks/mon_mdp_neurips24.
Open Datasets	Yes	We validate our exploration strategy on tabular MDPs (Figure 4) characterized by different challenges, e.g., sparse rewards, distracting rewards, stochastic transitions. For each MDP, we propose the following Mon-MDP versions of increasing difficulty.
Dataset Splits	No	The paper does not explicitly mention using a separate validation set. It describes testing the greedy policies at regular intervals during training.
Hardware Specification	Yes	We ran our experiments on a SLURM-based cluster, using 32 Intel E5-2683 v4 Broadwell @ 2.1GHz CPUs.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	For all algorithms, γ = 0.99 and ϵt starts at 1 and linearly decays to 0. The schedule αt depends on the environment: constant 0.5 for Hazard and Two-Room (3 5) (because of the quicksand cell), linear decay from 0.5 to 0.05 in River Swim (because of the stochastic transition), and constant 1 otherwise. For the Random Experts Monitor we linearly decay the learning rate to 0.1 in all environments (0.05 in River Swim) because of the random monitor state. Discount factor γ = 0.99.