Constrained Decision Transformer for Offline Safe Reinforcement Learning

Authors: Zuxin Liu, Zijian Guo, Yihang Yao, Zhepeng Cen, Wenhao Yu, Tingnan Zhang, Ding Zhao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints.In this section, we aim to evaluate the proposed approach and empirically answer the following questions: 1) can CDT learn a safe policy from a small ϵ reducible offline dataset? 2) what is the importance of each component in CDT? 3) can CDT achieve zero-shot adaption to different constraint thresholds? 4) is CDT robust to conflict reward returns?
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Google Deepmind.
Pseudocode Yes Algorithm 1 Data Augmentation via RelabelingAlgorithm 2 Returns Conditioned Evaluation for CDTAlgorithm 3 CDT Training ProcedureAlgorithm 4 CPPO
Open Source Code No No statement or link regarding the public release of source code for the described methodology is provided in the paper.
Open Datasets Yes The dataset format follows the D4RL benchmark (Fu et al., 2020), where we add another cost entry to record binary constraint violation signals.
Dataset Splits No The paper does not explicitly state specific training/validation/test dataset splits (e.g., percentages, sample counts, or predefined split citations) needed for reproduction.
Hardware Specification No No specific hardware details (such as GPU/CPU models, memory, or specific computing environments) used for running the experiments are mentioned in the paper.
Software Dependencies No The paper mentions 'Bullet safety gym (Gronauer, 2022)' but does not provide specific version numbers for any other software dependencies, libraries, or programming languages used.
Experiment Setup Yes The complete hyperparameters used in the experiments are shown in Table 4. Parameter All tasks Parameter Ant-Run Car-Circle Car-Run Drone-Circle Drone-Run Number of layers 3 Actor hidden size [256, 256] BCQ-Lag, BEAR-Lag Number of attention heads 8 [300, 300] CPQ Embedding dimension 128 VAE hidden size [750, 750] BEAR-Lag Batch size 2048 [400, 400] CPQ Context length K 10 Rollout length 200 300 200 300 100 Learning rate 0.0001 [KP , KI, KD] [0.1, 0.003, 0.001] BCQ-Lag, BEAR-Lag Droupout 0.1 Batch size 512 Adam betas (0.9, 0.999) Actor learning rate 0.0001 0.001 0.0001 0.0001 0.001 Grad norm clip 0.25 Critic learning rate 0.001 0.001 0.001 0.001 0.001