Safe Offline Reinforcement Learning with Real-Time Budget Constraints

Authors: Qian Lin, Bo Tang, Zifan Wu, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, Dong Wang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on a wide range of simulation tasks and a real-world large-scale advertising application demonstrate the capability of TREBI in solving real-time budget constraint problems under offline settings.
Researcher Affiliation Collaboration 1Sun Yat-Sen University 2Meituan.
Pseudocode Yes Algorithm 1 Trajectory-based REal-time Budget Inference
Open Source Code Yes The overall implementation of TREBI is based on Diffuser (Janner et al., 2022) and can be found at https://github.com/qianlin04/Safe-offline-RL-with-diffusion-model.
Open Datasets Yes on two Open AI Gym tasks with additional safety constraints (Pendulum swing-up and Reacher) (Sootla et al., 2022), three Mu Jo Co tasks (Hopper-v2, Half Cheetah-v2, Walker2d-v2) (Todorov et al., 2012) with speed limit (Zhang et al., 2020; Yang et al., 2022) and two Bullet-Safety-Gym tasks (Safety Car Circle-v0, Safety Ball Reach-v0) (Gronauer, 2022). For Mu Jo Co tasks, we test on three types of datasets in the D4RL benchmark (Fu et al., 2020).
Dataset Splits No The paper discusses training and evaluation, and mentions using D4RL benchmarks which have pre-defined splits, but it does not explicitly state the train/validation/test splits (e.g., percentages or absolute counts) for the datasets used in their experiments. It primarily focuses on the evaluation protocols for different budgets rather than dataset partitioning.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions 'Python 3.8' in the implementation details, but does not list specific version numbers for other key software components, libraries (like PyTorch or TensorFlow), or specialized solvers required to replicate the experiment.
Experiment Setup Yes The 'Hyper-parameter Settings' section in Appendix D specifies numerous hyperparameters, including 'the learning rate is 1e-5 for the actor and 1e-3 for the critic', 'the time interval of regenerating the trajectory for decision making (i.e., control frequency) is set to 1', 'The cost discount factor is set to γ = 1 for Pendulum, Reacher and two Bullet-Safety-Gym task, and γ = 0.99 for all Mu Jo Co tasks', 'The max episode length is set to 200 for Pendulum, 50 for Reacher, 1000 for all Mu Jo Co tasks, 500 for Safety Car Circle-v0 and 250 for Safety Ball Reach-v0', 'The hyper-parameter n... is set to 1000 for Pendulum, Reacher, the Ads bidding task and 100 for all Mu Jo Co tasks and Bullet-Safety-Gym tasks', 'The length of trajectory is set to 128 for Pendulum and the Ads bidding task, and 32 for Reacher, all the Mu Jo Co tasks and Bullet-Safety-Gym tasks', and 'The hyper-parameter α in Eq. (16) is set to 0.1 for all tasks'.