reproducibilityindex.ai

Safe Offline Reinforcement Learning with Real-Time Budget Constraints

Authors: Qian Lin, Bo Tang, Zifan Wu, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, Dong Wang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.
Researcher Affiliation	Collaboration	1Sun Yat-Sen University 2Meituan.
Pseudocode	Yes	Algorithm 1 Trajectory-based REal-time Budget Inference
Open Source Code	Yes	The overall implementation of TREBI is based on Diffuser (Janner et al., 2022) and can be found at https://github.com/qianlin04/Safe-offline-RL-with-diffusion-model.
Open Datasets	Yes	on two Open AI Gym tasks with additional safety constraints (Pendulum swing-up and Reacher) (Sootla et al., 2022), three Mu Jo Co tasks (Hopper-v2, Half Cheetah-v2, Walker2d-v2) (Todorov et al., 2012) with speed limit (Zhang et al., 2020; Yang et al., 2022) and two Bullet-Safety-Gym tasks (Safety Car Circle-v0, Safety Ball Reach-v0) (Gronauer, 2022). For Mu Jo Co tasks, we test on three types of datasets in the D4RL benchmark (Fu et al., 2020).
Dataset Splits	No	The paper discusses training and evaluation, and mentions using D4RL benchmarks which have pre-defined splits, but it does not explicitly state the train/validation/test splits (e.g., percentages or absolute counts) for the datasets used in their experiments. It primarily focuses on the evaluation protocols for different budgets rather than dataset partitioning.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions 'Python 3.8' in the implementation details, but does not list specific version numbers for other key software components, libraries (like PyTorch or TensorFlow), or specialized solvers required to replicate the experiment.
Experiment Setup	Yes	The 'Hyper-parameter Settings' section in Appendix D specifies numerous hyperparameters, including 'the learning rate is 1e-5 for the actor and 1e-3 for the critic', 'the time interval of regenerating the trajectory for decision making (i.e., control frequency) is set to 1', 'The cost discount factor is set to γ = 1 for Pendulum, Reacher and two Bullet-Safety-Gym task, and γ = 0.99 for all Mu Jo Co tasks', 'The max episode length is set to 200 for Pendulum, 50 for Reacher, 1000 for all Mu Jo Co tasks, 500 for Safety Car Circle-v0 and 250 for Safety Ball Reach-v0', 'The hyper-parameter n... is set to 1000 for Pendulum, Reacher, the Ads bidding task and 100 for all Mu Jo Co tasks and Bullet-Safety-Gym tasks', 'The length of trajectory is set to 128 for Pendulum and the Ads bidding task, and 32 for Reacher, all the Mu Jo Co tasks and Bullet-Safety-Gym tasks', and 'The hyper-parameter α in Eq. (16) is set to 0.1 for all tasks'.