Model-Based Transfer Learning for Contextual Reinforcement Learning
Authors: Jung-Hoon Cho, Vindula Jayawardana, Sirui Li, Cathy Wu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally validate our methods using urban traffic and standard continuous control benchmarks. The experimental results suggest that MBTL can achieve up to 43x improved sample efficiency compared with canonical independent training and multi-task training. |
| Researcher Affiliation | Academia | Jung-Hoon Cho MIT jhooncho@mit.edu Vindula Jayawardana MIT vindula@mit.edu Sirui Li MIT siruil@mit.edu Cathy Wu MIT cathywu@mit.edu |
| Pseudocode | Yes | A.2 Model-Based Transfer Learning (MBTL) Algorithm |
| Open Source Code | Yes | Code is available at https://github.com/jhoon-cho/MBTL/. |
| Open Datasets | Yes | Our experiments consider CMDPs that span standard and real-world benchmarks. In particular, we consider standard continuous control benchmarks from the CARL library [5]. In addition, we study problems from RL for intelligent transportation systems, using [49] to model the CMDPs. ... We used the microscopic traffic simulation called Simulation of Urban MObility (SUMO) [26] v.1.16.0 ... License: CARL falls under the Apache License 2.0 as is permitted by all work that we use [5]. |
| Dataset Splits | No | The paper specifies training on K source tasks and evaluating on N target tasks. For example, 'We evaluate our method by the average performance across all N target tasks after training up to K =15 source tasks or the number of source tasks needed to achieve a certain level of performance.' However, it does not explicitly detail a specific validation dataset split (e.g., percentages or counts for a separate validation set) from the training or test data. |
| Hardware Specification | Yes | All experiments are done on a distributed computing cluster equipped with 48 Intel Xeon Platinum 8260 CPUs. |
| Software Dependencies | Yes | We used the microscopic traffic simulation called Simulation of Urban MObility (SUMO) [26] v.1.16.0 and PPO for RL algorithm [36]. We utilized the default implementation of the PPO algorithm with default hyperparameters provided by the Stable-Baselines3 library [34]. |
| Experiment Setup | Yes | We utilize Deep Q-Networks (DQN) for discrete action spaces [29] and Proximal Policy Optimization (PPO) for continuous action spaces [36]. For statistical reliability, we run each experiment three times with different random seeds. We employ min-max normalization of the rewards for each task, and we provide comprehensive details about our model in Appendix A.4.1. ... We used the Gaussian Process Regressor implementation from scikit-learn... We vary the GP hyperparameters, including noise standard deviation over the set {0.001, 0.01, 0.1, 1}, the number of restarts for the optimizer over {5, 6, . . . , 15}, and explore several kernel configurations on the synthetic data. |