Learning Pseudometric-based Action Representations for Offline Reinforcement Learning

Authors: Pengjie Gu, Mengchen Zhao, Chen Chen, Dong Li, Jianye Hao, Bo An

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our methods significantly improve the performance of two typical offline RL methods in environments with large and discrete action spaces.Experimental results on two simulated tasks and two real-world applications show that policies trained under the MERLION framework significantly outperform those trained using existing baselines in both offline and online settings. In this section, we empirically show that MERLION could be used as a drop-in extension for improving the policy performance in existing offline RL algorithms for problems with large discrete action spaces.
Researcher Affiliation Collaboration 1School of Computer Science and Engineering, Nanyang Technological University, Singapore 2Noah s Ark Lab, Huawei 3College of Intelligence and Computing, Tianjin University.
Pseudocode Yes Its scheme is also described by Fig.1 (a) and Alg.1 in the appendix. Further details of the action encoder and all models are described in Alg.2 in the appendix. (Appendix A.2: Algorithm 1 Train policy, Algorithm 2 Pseudometric-based representation learning)
Open Source Code No The paper states: "For the applied offline RL algorithms: BCQ (Fujimoto et al., 2019b), CQL (Kumar et al., 2020), and their discrete versions, we all adopt their open-source implementations released by the authors." (Appendix A.3). This refers to the open-source code of *other* authors for baseline algorithms, not the open-source code for MERLION, the method proposed in this paper.
Open Datasets No The paper states: "Since there are no open-source datasets for the offline RL tasks with large discrete action spaces, we collected logged experience trajectories generated from online RL policies." (Appendix A.4). While they utilize open-source platforms for environment simulation, the specific logged experience trajectories (dataset) used for their experiments were collected by the authors and are not explicitly stated as publicly available, nor is a link or citation provided for them.
Dataset Splits No The paper mentions collecting "100000 pieces of transition data in each environment" (Appendix A.4) and describes training parameters like batch size and learning steps (Appendix A.3). However, it does not provide specific details on how this collected dataset is split into training, validation, and test sets (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as CPU or GPU models, memory specifications, or cloud computing instances.
Software Dependencies No The paper mentions using "the Py Torch python package" and the "Adam" optimizer (Appendix A.3). However, it does not provide specific version numbers for PyTorch or any other software libraries or frameworks, which is necessary for reproducibility.
Experiment Setup Yes We set the batch size as 128 and set the training gradient steps for all models as 10000. It is conducted using Adam with a learning rate of 10 2, and with no momentum or weight decay. We set the dimension of the action representations |E| = 2, 2, 10, 30 and the penalty coefficient p = 0.01, 0.1, 0.1, 0.3 respectively in the maze environment, the multi-step maze environment, the recommender system, and the dialogue system respectively. For the discrete CQL, We have searched over the crucial hyperparameter α = {0.1, 0.3, 0.5, 0.7, 0.9}, which determines the extent of conservative estimation of value functions. The best settings for four environments (Maze, Multi-step maze, Recommendation system, Dialogue system) are 0.5, 0.5, 0.5, 0.3, respectively. For the discrete BCQ, we have searched over 0.1, 0.3, 0.5, 0.7, 0.9 for the hyperparameter of the threshold τ, which determines the range of the candidate actions. The best settings for four environments (Maze, Multi-step maze, Recommendation system, Dialogue system) are 0.3, 0.3, 0.3, 0.3, respectively.