An Imitation from Observation Approach to Transfer Learning with Dynamics Mismatch

Authors: Siddharth Desai, Ishan Durugkar, Haresh Karnan, Garrett Warnell, Josiah Hanna, Peter Stone

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We run experiments in several domains with mismatched dynamics, and find that agents trained with GARAT achieve higher returns in the target environment compared to existing black-box transfer methods. To validate our hypothesis we derive a new algorithm generative adversarial reinforced action transformation (GARAT) based on adversarial imitation from observation techniques.
Researcher Affiliation Collaboration Siddarth Desai Department of Mechanical Engineering The University of Texas at Austin sidrdesai@utexas.edu Ishan Durugkar Department of Computer Science The University of Texas at Austin ishand@cs.utexas.edu Haresh Karnan Department of Mechanical Engineering The University of Texas at Austin haresh.miriyala@utexas.edu Garrett Warnell Army Research Laboratory garrett.a.warnell.civ@mail.mil Josiah P. Hanna School of Informatics The University of Edinburgh josiah.hanna@ed.ac.uk Peter Stone Department of Computer Science The University of Texas at Austin and Sony AI pstone@cs.utexas.edu
Pseudocode Yes Algorithm 1 lays out its details.
Open Source Code No The paper mentions using implementations from the 'stable-baselines library [17]' for TRPO and PPO, but does not state that its own code for GARAT or its experiments is publicly available.
Open Datasets Yes We validate GARAT for transfer by transferring the agent policy between Open AI Gym [7] simulated environments with different transition dynamics. For various Mu Jo Co [47] environments... Apart from the Mu Jo Co simulator, we also show successful transfer in the Py Bullet simulator [9] using the Ant domain.
Dataset Splits No The paper describes training policies and evaluating them in different environments or across a number of episodes, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts in the way typically required for static datasets.
Hardware Specification No The paper does not specify the hardware used for running the experiments, such as specific CPU or GPU models, or details about computational resources.
Software Dependencies No The paper states 'We use the implementations of TRPO and PPO provided in the stable-baselines library [17].' but does not specify the version number for stable-baselines or any other software dependencies.
Experiment Setup Yes The specific hyperparameters used are provided in Appendix C.