An Offline Adaptation Framework for Constrained Multi-Objective Reinforcement Learning
Authors: Qian Lin, Zongkai Liu, Danying Mo, Chao Yu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on offline multi-objective and safe tasks demonstrate the capability of our framework to infer policies that align with real preferences while meeting the constraints implied by the provided demonstrations. |
| Researcher Affiliation | Collaboration | 1Sun Yat-sen University, Guangzhou, China 2Pengcheng Laboratory, Shenzhen, China 3Mo E Key Laboratory of Information Technology, Guangzhou, China |
| Pseudocode | Yes | Algorithm 1 Preference Distribution Offline Adaptation |
| Open Source Code | Yes | Codes and instructions are provided in supplemental material to generate the dataset used and reproduce the main results in the paper. |
| Open Datasets | Yes | We utilize the D4MORL dataset [Zhu et al., 2023] collected from multi-objective Mu Jo Co environments... We utilize the datasets in DSRL benchmark [Liu et al., 2023b] that are collected by a set of behavior policies trained under various safe thresholds. |
| Dataset Splits | No | The unselected trajectories in datasets constitute the training set. |
| Hardware Specification | Yes | The training and testing were conducted on 1 NVIDIA Ge Force RTX 3090 GPU. |
| Software Dependencies | No | For MODF, we use the original implementation in https: //github.com/qianlin04/PRMORL. |
| Experiment Setup | Yes | The weight η of the regularization term ( µ 1 1)2 in Eq. (7) is set to 1.0. For each target, the number of gradient updates is set to 1000, with 64 preferences sampled from the distribution for each gradient update. All samples in the demonstration set are used for gradient updates within one batch. We use the Adam optimizer with a learning rate of 0.05. The conservatism weight α in Eq. (10) is set to 1.0 for MORL tasks and 0.7 for safe RL and CMORL tasks. The weight of the TD reward in Eq. (6) is set to 0.01 for PDOA [MODF]. |