Global Policy Construction in Modular Reinforcement Learning
Authors: Ruohan Zhang, Zhao Song, Dana Ballard
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments Our test domain is a navigation task in a 2D grid world of size 9 9 , as shown in Figure 1(a). ... We compare the performance of our three algorithms with two baseline algorithms: a random agent and a reflex agent. Two performance criteria are average success rate and average number of steps to complete a successful trial. ... The results are shown in Figure 1(b) and (c). Our algorithms have higher success rate and requires fewer steps to success. |
| Researcher Affiliation | Academia | Ruohan Zhang and Zhao Song and Dana H. Ballard Department of Computer Science, The University of Texas at Austin 2317 Speedway, Stop D9500, Austin Texas 78712-1757 USA {zharu,zhaos}@utexas.edu dana@cs.utexas.edu |
| Pseudocode | No | The paper describes the algorithms in prose but does not provide structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | No | Our test domain is a navigation task in a 2D grid world of size 9 9 , as shown in Figure 1(a). Our agent starts at the center, and its action space A = {up, down, left, right}. There are prizes that need to be collected. There are also cells that are obstacles, stepping onto an obstacle incurs a negative reward. The dark dot is a predator, starting at upper left corner of the map, which chases the agent with probability .5 and choose a random action otherwise. Being captured by predator resulted in termination of an experiment trial and a large negative reward. The paper describes the environment but does not provide access to it as a public dataset. |
| Dataset Splits | No | The paper mentions generating maps and running trials but does not provide specific train/validation/test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific hardware details (like CPU/GPU models or memory) used for running its experiments. |
| Software Dependencies | No | The paper refers to algorithms like Sarsa(λ) but does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | For reward function, we set Rprize = +10, Robstacle = 10, Rpredator = 100 for entering the state (0, 0). For discount factor, γprize = .7, γobstacle = 0, γpredator = .1. ... A trial is successful if agent collects all prizes within 250 steps, without being captured by predator. ... We randomly pick 10% of cells to contain a prize. Let pobstacle denote the proportion of cells being obstacle. Since this value defines task difficulty, we choose pobstacle [0, .2] with step size of .01, resulted in 21 levels of difficulty. For each level, we randomly generate 103 maps with different layouts, and agent navigates each map for 5 trials, testing one algorithm per trial. |