Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Policy-conditioned Environment Models are More Generalizable
Authors: Ruifeng Chen, Xiong-Hui Chen, Yihao Sun, Siyuan Xiao, Minhui Li, Yang Yu
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are conducted based on Mu Jo Co (Todorov et al., 2012). We first conducted a proof-of-concept experiment, utilizing our custom-made dataset, which verified the effectiveness of the policy-aware mechanism for improving the model prediction accuracy. Then apply PCM in several downstream tasks. Results show that PCM improves the performance of off-policy evaluation in the DOPE benchmark with a large margin, and derives significantly better policies in offline policy selection and model predictive control compared with the standard model learning method. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China 2Polixir Technologies. |
| Pseudocode | Yes | The pseudocode of PCM via policy representation is listed in Alg. 1 |
| Open Source Code | Yes | code: https://github.com/xionghuichen/policy-conditioned-model.git |
| Open Datasets | Yes | We evaluate these approaches on a variety of tasks from DOPE-D4RL and DOPE-RL-Unplugged benchmarks (Fu et al., 2021a), where the data in these tasks is collected by diverse policies. |
| Dataset Splits | No | The paper mentions collecting datasets and training models on them, but it does not provide specific details on training/test/validation dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run its experiments. |
| Software Dependencies | No | The paper mentions using specific frameworks like 'Pytorch' in the repository name (Offlinerl-kit: An elegant pytorch offline reinforcement learning library) but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | Table 3: Training hyperparameters of PAM and PCM. Hyperparameters Value Description Batch size 32 Batch size for gradient descent. Optimizer Adam Optimizer. Learning rate 1e-4 Learning rate for gradient descent. Dropout rate 0.1 Dropout rate. λ 0.01 Weight of policy representation loss. |