Policy-conditioned Environment Models are More Generalizable

Authors: Ruifeng Chen, Xiong-Hui Chen, Yihao Sun, Siyuan Xiao, Minhui Li, Yang Yu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted based on Mu Jo Co (Todorov et al., 2012). We first conducted a proof-of-concept experiment, utilizing our custom-made dataset, which verified the effectiveness of the policy-aware mechanism for improving the model prediction accuracy. Then apply PCM in several downstream tasks. Results show that PCM improves the performance of off-policy evaluation in the DOPE benchmark with a large margin, and derives significantly better policies in offline policy selection and model predictive control compared with the standard model learning method.
Researcher Affiliation Collaboration 1National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China 2Polixir Technologies.
Pseudocode Yes The pseudocode of PCM via policy representation is listed in Alg. 1
Open Source Code Yes code: https://github.com/xionghuichen/policy-conditioned-model.git
Open Datasets Yes We evaluate these approaches on a variety of tasks from DOPE-D4RL and DOPE-RL-Unplugged benchmarks (Fu et al., 2021a), where the data in these tasks is collected by diverse policies.
Dataset Splits No The paper mentions collecting datasets and training models on them, but it does not provide specific details on training/test/validation dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run its experiments.
Software Dependencies No The paper mentions using specific frameworks like 'Pytorch' in the repository name (Offlinerl-kit: An elegant pytorch offline reinforcement learning library) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes Table 3: Training hyperparameters of PAM and PCM. Hyperparameters Value Description Batch size 32 Batch size for gradient descent. Optimizer Adam Optimizer. Learning rate 1e-4 Learning rate for gradient descent. Dropout rate 0.1 Dropout rate. λ 0.01 Weight of policy representation loss.