Policy-conditioned Environment Models are More Generalizable
Authors: Ruifeng Chen, Xiong-Hui Chen, Yihao Sun, Siyuan Xiao, Minhui Li, Yang Yu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are conducted based on Mu Jo Co (Todorov et al., 2012). We first conducted a proof-of-concept experiment, utilizing our custom-made dataset, which verified the effectiveness of the policy-aware mechanism for improving the model prediction accuracy. Then apply PCM in several downstream tasks. Results show that PCM improves the performance of off-policy evaluation in the DOPE benchmark with a large margin, and derives significantly better policies in offline policy selection and model predictive control compared with the standard model learning method. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China 2Polixir Technologies. |
| Pseudocode | Yes | The pseudocode of PCM via policy representation is listed in Alg. 1 |
| Open Source Code | Yes | code: https://github.com/xionghuichen/policy-conditioned-model.git |
| Open Datasets | Yes | We evaluate these approaches on a variety of tasks from DOPE-D4RL and DOPE-RL-Unplugged benchmarks (Fu et al., 2021a), where the data in these tasks is collected by diverse policies. |
| Dataset Splits | No | The paper mentions collecting datasets and training models on them, but it does not provide specific details on training/test/validation dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run its experiments. |
| Software Dependencies | No | The paper mentions using specific frameworks like 'Pytorch' in the repository name (Offlinerl-kit: An elegant pytorch offline reinforcement learning library) but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | Table 3: Training hyperparameters of PAM and PCM. Hyperparameters Value Description Batch size 32 Batch size for gradient descent. Optimizer Adam Optimizer. Learning rate 1e-4 Learning rate for gradient descent. Dropout rate 0.1 Dropout rate. λ 0.01 Weight of policy representation loss. |