reproducibilityindex.ai

Policy-conditioned Environment Models are More Generalizable

Authors: Ruifeng Chen, Xiong-Hui Chen, Yihao Sun, Siyuan Xiao, Minhui Li, Yang Yu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments are conducted based on Mu Jo Co (Todorov et al., 2012). We first conducted a proof-of-concept experiment, utilizing our custom-made dataset, which verified the effectiveness of the policy-aware mechanism for improving the model prediction accuracy. Then apply PCM in several downstream tasks. Results show that PCM improves the performance of off-policy evaluation in the DOPE benchmark with a large margin, and derives significantly better policies in offline policy selection and model predictive control compared with the standard model learning method.
Researcher Affiliation	Collaboration	1National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China 2Polixir Technologies.
Pseudocode	Yes	The pseudocode of PCM via policy representation is listed in Alg. 1
Open Source Code	Yes	code: https://github.com/xionghuichen/policy-conditioned-model.git
Open Datasets	Yes	We evaluate these approaches on a variety of tasks from DOPE-D4RL and DOPE-RL-Unplugged benchmarks (Fu et al., 2021a), where the data in these tasks is collected by diverse policies.
Dataset Splits	No	The paper mentions collecting datasets and training models on them, but it does not provide specific details on training/test/validation dataset splits (e.g., percentages or sample counts).
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run its experiments.
Software Dependencies	No	The paper mentions using specific frameworks like 'Pytorch' in the repository name (Offlinerl-kit: An elegant pytorch offline reinforcement learning library) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	Table 3: Training hyperparameters of PAM and PCM. Hyperparameters Value Description Batch size 32 Batch size for gradient descent. Optimizer Adam Optimizer. Learning rate 1e-4 Learning rate for gradient descent. Dropout rate 0.1 Dropout rate. λ 0.01 Weight of policy representation loss.