Planning and Learning for Decentralized MDPs With Event Driven Rewards

Authors: Tarun Gupta, Akshat Kumar, Praveen Paruchuri

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on: (a) Mars-rover problem from (Becker et al. 2004), and (b) Multiagent coverage under uncertainty and partial observability. We show that our multiagent RL scales well for this problem whereas EM and NLP fails, and also provides much better solution quality than independently optimizing agent policies, confirming the effectiveness of incorporating joint-events for computing gradients. Thus, our work significantly advances the scalability of multiagent planning for real-world problems. We tested on two domains Mars rover and the Multiagent coverage problem. Figure 3a shows the runtime results for NLP and EM for the four categories, Easy, Medium, Hard, All , of problems on the x-axis. Figure 5 shows the average reward quality achieved by MARL for different settings of the reset time k.
Researcher Affiliation Academia 1Machine Learning Lab, Kohli Center on Intelligent Systems, IIIT Hyderabad 2School of Information Systems, Singapore Management University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It presents mathematical formulations and equations but no procedural code-like descriptions.
Open Source Code No The paper does not include any explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We experiment on: (a) Mars-rover problem from (Becker et al. 2004), and (b) Multiagent coverage under uncertainty and partial observability. We test our multiagent RL approach on the real MRT map of Singapore.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies No The paper mentions software like 'NLP solvers', 'SNOPT', and 'deep neural networks' but does not specify version numbers for these, which is required for reproducible software dependencies.
Experiment Setup Yes The time horizon was 1024 minutes (17 hours). To claim the reward, agents must successfully inspect a location once every k time steps (also called reset time), and k (in hours) varied in the range {0.5, 1, 2, 4}. The inspect action consumes 15 minutes, and moving to the next location on the line takes 3 minutes. The joint reward to inspect any shared location was much higher than private locations given that shared locations are heavily crowded and thus more important.