Belief Reward Shaping in Reinforcement Learning
Authors: Ofir Marom, Benjamin Rosman
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are run on a gridworld and a more complex backgammon domain that show that we can learn tasks significantly faster when we specify intuitive priors on the reward distribution. Each algorithm is tested on a grid of parameter values. BRS(μ0,λ) is run for μ0 {0.5, 1, 5, 10, 100} and λ {100, 500, 1000, 5000, 10000}. PBRS(μ0,μ1) and PBA(μ0,μ1,μ2) are run where each parameter takes values in {0.5, 1, 5, 10, 100}. We run each algorithm / parameter value over 1000 episodes and average over 50 independent runs. Plots are then averaged over 10 consecutive points with error bars included. |
| Researcher Affiliation | Academia | Ofir Marom University of the Witwatersrand Johannesburg, South Africa Benjamin Rosman University of the Witwatersrand Johannesburg, South Africa, and Council for Scientific and Industrial Research Pretoria, South Africa |
| Pseudocode | Yes | Algorithm 1: Q-learning algorithm augmented with BRS for episodic tasks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | No | The paper describes experiments on a 'gridworld' and 'backgammon domain' and mentions using an 'identical setup to TDG0.0' for backgammon. However, it does not provide concrete access information (links, DOIs, specific citations with authors/year) for any publicly available or open datasets used for training. |
| Dataset Splits | No | The paper describes training procedures (e.g., 'train a baseline neural network (BL) over 50,000 games') and uses terms like 'episodes', but it does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into train/validation/test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions algorithms like 'Q-learning' and 'TD(λ) neural networks' but does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | For learning to solve this RL problem we use an ϵ-greedy policy that explores random actions with some probability ϵ and acts greedily otherwise. We use γ = 1, a constant learning-rate α = 0.05 and ϵ = 0.1. We set pjump = 0.2 and apply an early termination criterion of 100 steps per episode. In our setup, the states for each of these concepts are grouped into a separate belief cluster and we use |μ0| = 0.5 and λ = 3000 for all belief clusters the sign of μ0 depends on whether we want to incentivise or disincentivise the states in that cluster. |