Towards a Pretrained Model for Restless Bandits via Multi-arm Generalization

Authors: Yunfan Zhao, Nikhil Behari, Edward Hughes, Edwin Zhang, Dheeraj Nagaraj, Karl Tuyls, Aparna Taneja, Milind Tambe

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We theoretically prove the benefits of multi-arm generalization and empirically demonstrate the advantages of our approach on several challenging, real-world inspired problems. We provide experimental evaluations of our model in three separate domains, including a synthetic setting, an epidemic modeling setting, as well as a maternal healthcare intervention setting. In Appendix B, we provide ablation studies over (1) a wider range of opt-in rates (2) different feature mappings (3) DDLPO topline with and without features (4) more problem settings.
Researcher Affiliation Collaboration Yunfan Zhao 1, Nikhil Behari 1 , Edward Hughes2 , Edwin Zhang1 , Dheeraj Nagaraj2 , Karl Tuyls2 , Aparna Taneja2 and Milind Tambe1,2 1Harvard University 2Google
Pseudocode Yes Algorithm 1 Pre Fe RMAB (Training), Algorithm 2 State Shaping Subroutine, Algorithm 3 Pre Fe RMAB (Inference)
Open Source Code Yes Code is available at https://github.com/yzhao3685/Pre Fe RMAB
Open Datasets Yes Following [Killian et al., 2022], we consider a synthetic dataset with binary states and binary actions. Inspired by the vast literature on agent-based epidemic modeling, we adapt the SIS model given in [Yaesoubi and Cohen, 2011], following a similar experiment setup as described in [Killian et al., 2022]. Similar to the set up in [Biswas et al., 2021; Killian et al., 2022], we model the real world maternal health problem as a discrete state RMAB.
Dataset Splits No The paper discusses training and testing but does not provide explicit details about a validation dataset split (e.g., percentages, sample counts, or specific methodology).
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using the PPO algorithm and the Ray RLlib library, but it does not specify any version numbers for these or other software dependencies.
Experiment Setup Yes In Appendix A, we provide additional details, including hyperparameters and State Shaping illustration in Appendix A. All experiments use the PPO algorithm [Schulman et al., 2017] implemented with the Ray RLlib library. We set the discount factor to β = 0.99. Unless specified, we set the number of arms N = 21, budget B = 7 for Synthetic experiments; N = 20, B = 16 for SIS experiments; N = 25, B = 7 for ARMMAN experiments. The batch size is 512, with a learning rate of 0.0001.