Model-Based Transfer Learning for Contextual Reinforcement Learning

Authors: Jung-Hoon Cho, Vindula Jayawardana, Sirui Li, Cathy Wu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally validate our methods using urban traffic and standard continuous control benchmarks. The experimental results suggest that MBTL can achieve up to 43x improved sample efficiency compared with canonical independent training and multi-task training.
Researcher Affiliation Academia Jung-Hoon Cho MIT jhooncho@mit.edu Vindula Jayawardana MIT vindula@mit.edu Sirui Li MIT siruil@mit.edu Cathy Wu MIT cathywu@mit.edu
Pseudocode Yes A.2 Model-Based Transfer Learning (MBTL) Algorithm
Open Source Code Yes Code is available at https://github.com/jhoon-cho/MBTL/.
Open Datasets Yes Our experiments consider CMDPs that span standard and real-world benchmarks. In particular, we consider standard continuous control benchmarks from the CARL library [5]. In addition, we study problems from RL for intelligent transportation systems, using [49] to model the CMDPs. ... We used the microscopic traffic simulation called Simulation of Urban MObility (SUMO) [26] v.1.16.0 ... License: CARL falls under the Apache License 2.0 as is permitted by all work that we use [5].
Dataset Splits No The paper specifies training on K source tasks and evaluating on N target tasks. For example, 'We evaluate our method by the average performance across all N target tasks after training up to K =15 source tasks or the number of source tasks needed to achieve a certain level of performance.' However, it does not explicitly detail a specific validation dataset split (e.g., percentages or counts for a separate validation set) from the training or test data.
Hardware Specification Yes All experiments are done on a distributed computing cluster equipped with 48 Intel Xeon Platinum 8260 CPUs.
Software Dependencies Yes We used the microscopic traffic simulation called Simulation of Urban MObility (SUMO) [26] v.1.16.0 and PPO for RL algorithm [36]. We utilized the default implementation of the PPO algorithm with default hyperparameters provided by the Stable-Baselines3 library [34].
Experiment Setup Yes We utilize Deep Q-Networks (DQN) for discrete action spaces [29] and Proximal Policy Optimization (PPO) for continuous action spaces [36]. For statistical reliability, we run each experiment three times with different random seeds. We employ min-max normalization of the rewards for each task, and we provide comprehensive details about our model in Appendix A.4.1. ... We used the Gaussian Process Regressor implementation from scikit-learn... We vary the GP hyperparameters, including noise standard deviation over the set {0.001, 0.01, 0.1, 1}, the number of restarts for the optimizer over {5, 6, . . . , 15}, and explore several kernel configurations on the synthetic data.