Global Policy Construction in Modular Reinforcement Learning

Authors: Ruohan Zhang, Zhao Song, Dana Ballard

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments Our test domain is a navigation task in a 2D grid world of size 9 9 , as shown in Figure 1(a). ... We compare the performance of our three algorithms with two baseline algorithms: a random agent and a reflex agent. Two performance criteria are average success rate and average number of steps to complete a successful trial. ... The results are shown in Figure 1(b) and (c). Our algorithms have higher success rate and requires fewer steps to success.
Researcher Affiliation Academia Ruohan Zhang and Zhao Song and Dana H. Ballard Department of Computer Science, The University of Texas at Austin 2317 Speedway, Stop D9500, Austin Texas 78712-1757 USA {zharu,zhaos}@utexas.edu dana@cs.utexas.edu
Pseudocode No The paper describes the algorithms in prose but does not provide structured pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets No Our test domain is a navigation task in a 2D grid world of size 9 9 , as shown in Figure 1(a). Our agent starts at the center, and its action space A = {up, down, left, right}. There are prizes that need to be collected. There are also cells that are obstacles, stepping onto an obstacle incurs a negative reward. The dark dot is a predator, starting at upper left corner of the map, which chases the agent with probability .5 and choose a random action otherwise. Being captured by predator resulted in termination of an experiment trial and a large negative reward. The paper describes the environment but does not provide access to it as a public dataset.
Dataset Splits No The paper mentions generating maps and running trials but does not provide specific train/validation/test dataset splits.
Hardware Specification No The paper does not provide any specific hardware details (like CPU/GPU models or memory) used for running its experiments.
Software Dependencies No The paper refers to algorithms like Sarsa(λ) but does not provide specific ancillary software details with version numbers.
Experiment Setup Yes For reward function, we set Rprize = +10, Robstacle = 10, Rpredator = 100 for entering the state (0, 0). For discount factor, γprize = .7, γobstacle = 0, γpredator = .1. ... A trial is successful if agent collects all prizes within 250 steps, without being captured by predator. ... We randomly pick 10% of cells to contain a prize. Let pobstacle denote the proportion of cells being obstacle. Since this value defines task difficulty, we choose pobstacle [0, .2] with step size of .01, resulted in 21 levels of difficulty. For each level, we randomly generate 103 maps with different layouts, and agent navigates each map for 5 trials, testing one algorithm per trial.