Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dependency Structure Search Bayesian Optimization for Decision Making Models

Authors: Mohit Rajpal, Lac Gia Tran, Yehong Zhang, Bryan Kian Hsiang Low

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach shows strong empirical results under malformed or sparse reward. We validate our approach on several multi-agent benchmarks and show our approach outperforms related works for compact models fit for memory-constrained scenarios. Our dss-gp-ucb also overcomes sparse reward behavior in the reward function in multiple settings showing its effectiveness in Decision Making Models both in the single-agent and multi-agent settings.
Researcher Affiliation Academia Mohit Rajpal EMAIL Department of Computer Science National University of Singapore Lac Gia Tran EMAIL Department of Computer Science National University of Singapore Yehong Zhang EMAIL Peng Cheng Lab Bryan Kian Hsiang Low EMAIL Department of Computer Science National University of Singapore
Pseudocode Yes Algorithm 1 Role Assignment Algorithm 2 Role Interaction Algorithm 3 gen-Policy Algorithm 4 dss-gp-ucb
Open Source Code Yes The code of which is available in supplementary materials and will be open sourced.
Open Datasets Yes We conduct ablation experiments on Multiagent Ant with 6 agents, Pred Prey with 3 agents, and Heterogenous Pred Prey with 3 agents (Peng et al., 2021). Multiagent Ant is a Mu Jo Co (Todorov et al., 2012) locomotion task where each agent controls an individual appendage. Pred Prey is a task where predators must work together to catch faster, more agile prey. Het. Pred Prey is similar, except the predators have different capabilities of speed and acceleration. We repeat this validation in Appendix B with marl algorithms in multi-agent settings and consider a delayed feedback setting with similar results. For single agent rl we compared against SAC (Haarnoja et al., 2018), PPO (Schulman et al., 2017), TD3 (Fujimoto et al., 2018), and DDPG (Lillicrap et al., 2015) as well as an algorithm using intrinsic motivation (Zheng et al., 2018). ... The tested environments were standard Open AI Gym benchmarks of Ant, Hopper, Swimmer, and Walker2D.
Dataset Splits No The paper discusses various environments like Multiagent Ant, Pred Prey, Heterogenous Pred Prey, and standard OpenAI Gym benchmarks. It also mentions different numbers of agents (e.g., "Multiagent Ant with 6 agents"), and epoch lengths for environments (e.g., "epoch length of 1000, for Predator-Prey environments, epoch length was 25"). However, it does not provide specific details on training, validation, or test dataset splits (e.g., percentages or sample counts) for reproducibility.
Hardware Specification No All experiments were performed on commodity CPU and GPUs. Each experimental setting took no more than 2 days to complete on a single GPU.
Software Dependencies Yes We used Trieste (Berkeley et al., 2022), Tensorflow (Abadi et al., 2015), and GPFLow (Matthews et al., 2017) to build our work and perform comparisons using Mushroom RL (D Eramo et al., 2021), Multiagent Mu Jo Co (de Witt et al., 2020), Open AI Gym (Brockman et al., 2016), and Multi-agent Particle environment (Lowe et al., 2017).
Experiment Setup Yes All presented figures are average of 5 runs with shading representing Standard Error, the y-axis represents cumulative reward, the x-axis displayed above represents interactions with the environment in rl, x-axis displayed below represents iterations of bo. Commensurate with our focus on memory-constrained devices, all policy models consist of < 500 parameters. We observed that using neural networks of three layers with four neurons each to be sufficiently balanced across a wide variety of tasks. All marl environments were trained for 2, 000, 000 timesteps. We used a batch size of 15 in our comparison experiments. In this setting, all Mu Jo Co environments use the default epoch (total number of interactions with the environment for computing reward) length of 1000, for Predator-Prey environments, epoch length was 25, for Drone Delivery environment, epoch length was 150. In single agent setting, we trained related work for 200, 000 timesteps. In the marl setting, we trained for 2, 000, 000 timesteps. In both single-agent setting and multi-agent setting all policy networks for both dss-gp-ucb and related work was 3 layers of 10 neurons each. For computational efficiency, the epoch length for Mu Jo Co environments was reduced to 500. We used the Matern-5/2 as the base kernel in all our models.