Offline Transition Modeling via Contrastive Energy Learning
Authors: Ruifeng Chen, Chengxing Jia, Zefang Huang, Tian-Shuo Liu, Xu-Hui Liu, Yang Yu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct a series of experiments to answer the following questions: (1). Does ETM better recover the discontinuous transition behaviors than standard FTMs? (2). Does ETM have a smaller transition error on out-of-distribution transitions? (3). Can ETM facilitate sequential decision-making tasks like off-policy evaluation and offline RL? |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China 2Polixir Technologies. |
| Pseudocode | Yes | Algorithm 1 Energy-based Transition Model Learning |
| Open Source Code | Yes | 1code: https://github.com/Ruifeng-Chen/Energy-Transition-Models.git |
| Open Datasets | Yes | We also conduct experiments on D4RL benchmarks (Fu et al., 2020) , where the improvement of model accuracy boosts the performance of policy optimization. |
| Dataset Splits | No | No explicit statement providing specific percentages, sample counts, or clear predefined split references for training, validation, and test datasets was found for their experiments. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory, or cloud instance types) used for running the experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions software like 'Offline RLKit' and 'Soft Actor Critic (SAC)' and that the implementation is based on 'Pytorch', but no specific version numbers for any software dependencies are provided. |
| Experiment Setup | Yes | The detailed hyperparameter setting is listed in Appendix C. The hyperparameters are listed in Table 2. The base hyperparameter settings in Table 3 for all the Gym-Mujoco tasks. Two hyperparameters, penalty coefficient β and rollout length h, are tuned for each task and we list them in Table 4. |