Behavior Prior Representation learning for Offline Reinforcement Learning

Authors: Hongyu Zang, Xin Li, Jie Yu, Chen Liu, Riashat Islam, Remi Tachet des Combes, Romain Laroche

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks. The code is available at https://github.com/bit1029public/offline_bpr.
Researcher Affiliation Collaboration Hongyu Zang1, Xin Li1 , Jie Yu1, Chen Liu1, Riahsat Islam2, Rémi Tachet des Combes , Romain Laroche 1 Beijing Institute of Technology, China 2 Mila, Quebec AI Instiute, Canada {zanghyu,xinli,yujie,chenliu}@bit.edu.cn riashat.islam@mail.mcgill.ca {remi.tachet,romain.laroche}@gmail.com Work done while at Microsoft Research Montreal.
Pseudocode Yes We also provide the pseudocode of the pretraining process and the co-training process in Algorithm 1 and 2 in Appendix.
Open Source Code Yes The code is available at https://github.com/bit1029public/offline_bpr.
Open Datasets Yes We analyze our proposed method BPR on the D4RL benchmark (Fu et al., 2020) of Open AI gym Mu Jo Co tasks (Todorov et al., 2012) which includes a variety of datasets that have been commonly used in the Offline RL community.
Dataset Splits No The paper does not provide specific details on how the datasets (e.g., D4RL) are explicitly split into training, validation, and test sets by the authors themselves. It mentions evaluating models periodically during training, but this is not an explicit dataset split for validation purposes with specified percentages or sample counts.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as particular GPU models, CPU specifications, or memory sizes. It only mentions general terms like 'code platforms' (implying computational resources).
Software Dependencies No The paper mentions software platforms like 'Py Torch and Tensorflow' but does not specify their version numbers, which is necessary for reproducible software dependency information.
Experiment Setup Yes we first pretrain the encoder during 100k timesteps... Further details on the experiment setup are included in appendix G. In the experiment on d4rl tasks, all representation objectives use the same encoder architecture, i.e., with 4-layer MLP activated by Re LU, followed by another linear layer activated by Tanh, where the final output feature dimension of the encoder is 256. Besides, all representation objectives follow the same optimizer settings, pre-training data, and the number of pre-training epochs.