Constrained Decision Transformer for Offline Safe Reinforcement Learning
Authors: Zuxin Liu, Zijian Guo, Yihang Yao, Zhepeng Cen, Wenhao Yu, Tingnan Zhang, Ding Zhao
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints.In this section, we aim to evaluate the proposed approach and empirically answer the following questions: 1) can CDT learn a safe policy from a small ϵ reducible offline dataset? 2) what is the importance of each component in CDT? 3) can CDT achieve zero-shot adaption to different constraint thresholds? 4) is CDT robust to conflict reward returns? |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Google Deepmind. |
| Pseudocode | Yes | Algorithm 1 Data Augmentation via RelabelingAlgorithm 2 Returns Conditioned Evaluation for CDTAlgorithm 3 CDT Training ProcedureAlgorithm 4 CPPO |
| Open Source Code | No | No statement or link regarding the public release of source code for the described methodology is provided in the paper. |
| Open Datasets | Yes | The dataset format follows the D4RL benchmark (Fu et al., 2020), where we add another cost entry to record binary constraint violation signals. |
| Dataset Splits | No | The paper does not explicitly state specific training/validation/test dataset splits (e.g., percentages, sample counts, or predefined split citations) needed for reproduction. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models, memory, or specific computing environments) used for running the experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions 'Bullet safety gym (Gronauer, 2022)' but does not provide specific version numbers for any other software dependencies, libraries, or programming languages used. |
| Experiment Setup | Yes | The complete hyperparameters used in the experiments are shown in Table 4. Parameter All tasks Parameter Ant-Run Car-Circle Car-Run Drone-Circle Drone-Run Number of layers 3 Actor hidden size [256, 256] BCQ-Lag, BEAR-Lag Number of attention heads 8 [300, 300] CPQ Embedding dimension 128 VAE hidden size [750, 750] BEAR-Lag Batch size 2048 [400, 400] CPQ Context length K 10 Rollout length 200 300 200 300 100 Learning rate 0.0001 [KP , KI, KD] [0.1, 0.003, 0.001] BCQ-Lag, BEAR-Lag Droupout 0.1 Batch size 512 Adam betas (0.9, 0.999) Actor learning rate 0.0001 0.001 0.0001 0.0001 0.001 Grad norm clip 0.25 Critic learning rate 0.001 0.001 0.001 0.001 0.001 |