An Imitation from Observation Approach to Transfer Learning with Dynamics Mismatch
Authors: Siddharth Desai, Ishan Durugkar, Haresh Karnan, Garrett Warnell, Josiah Hanna, Peter Stone
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We run experiments in several domains with mismatched dynamics, and find that agents trained with GARAT achieve higher returns in the target environment compared to existing black-box transfer methods. To validate our hypothesis we derive a new algorithm generative adversarial reinforced action transformation (GARAT) based on adversarial imitation from observation techniques. |
| Researcher Affiliation | Collaboration | Siddarth Desai Department of Mechanical Engineering The University of Texas at Austin sidrdesai@utexas.edu Ishan Durugkar Department of Computer Science The University of Texas at Austin ishand@cs.utexas.edu Haresh Karnan Department of Mechanical Engineering The University of Texas at Austin haresh.miriyala@utexas.edu Garrett Warnell Army Research Laboratory garrett.a.warnell.civ@mail.mil Josiah P. Hanna School of Informatics The University of Edinburgh josiah.hanna@ed.ac.uk Peter Stone Department of Computer Science The University of Texas at Austin and Sony AI pstone@cs.utexas.edu |
| Pseudocode | Yes | Algorithm 1 lays out its details. |
| Open Source Code | No | The paper mentions using implementations from the 'stable-baselines library [17]' for TRPO and PPO, but does not state that its own code for GARAT or its experiments is publicly available. |
| Open Datasets | Yes | We validate GARAT for transfer by transferring the agent policy between Open AI Gym [7] simulated environments with different transition dynamics. For various Mu Jo Co [47] environments... Apart from the Mu Jo Co simulator, we also show successful transfer in the Py Bullet simulator [9] using the Ant domain. |
| Dataset Splits | No | The paper describes training policies and evaluating them in different environments or across a number of episodes, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts in the way typically required for static datasets. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments, such as specific CPU or GPU models, or details about computational resources. |
| Software Dependencies | No | The paper states 'We use the implementations of TRPO and PPO provided in the stable-baselines library [17].' but does not specify the version number for stable-baselines or any other software dependencies. |
| Experiment Setup | Yes | The specific hyperparameters used are provided in Appendix C. |