reproducibilityindex.ai

Belief Reward Shaping in Reinforcement Learning

Authors: Ofir Marom, Benjamin Rosman

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments are run on a gridworld and a more complex backgammon domain that show that we can learn tasks signiﬁcantly faster when we specify intuitive priors on the reward distribution. Each algorithm is tested on a grid of parameter values. BRS(μ0,λ) is run for μ0 {0.5, 1, 5, 10, 100} and λ {100, 500, 1000, 5000, 10000}. PBRS(μ0,μ1) and PBA(μ0,μ1,μ2) are run where each parameter takes values in {0.5, 1, 5, 10, 100}. We run each algorithm / parameter value over 1000 episodes and average over 50 independent runs. Plots are then averaged over 10 consecutive points with error bars included.
Researcher Affiliation	Academia	Oﬁr Marom University of the Witwatersrand Johannesburg, South Africa Benjamin Rosman University of the Witwatersrand Johannesburg, South Africa, and Council for Scientiﬁc and Industrial Research Pretoria, South Africa
Pseudocode	Yes	Algorithm 1: Q-learning algorithm augmented with BRS for episodic tasks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	No	The paper describes experiments on a 'gridworld' and 'backgammon domain' and mentions using an 'identical setup to TDG0.0' for backgammon. However, it does not provide concrete access information (links, DOIs, specific citations with authors/year) for any publicly available or open datasets used for training.
Dataset Splits	No	The paper describes training procedures (e.g., 'train a baseline neural network (BL) over 50,000 games') and uses terms like 'episodes', but it does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into train/validation/test sets.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions algorithms like 'Q-learning' and 'TD(λ) neural networks' but does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	For learning to solve this RL problem we use an ϵ-greedy policy that explores random actions with some probability ϵ and acts greedily otherwise. We use γ = 1, a constant learning-rate α = 0.05 and ϵ = 0.1. We set pjump = 0.2 and apply an early termination criterion of 100 steps per episode. In our setup, the states for each of these concepts are grouped into a separate belief cluster and we use \|μ0\| = 0.5 and λ = 3000 for all belief clusters the sign of μ0 depends on whether we want to incentivise or disincentivise the states in that cluster.