Unsupervised Cross-Domain Transfer in Policy Gradient Reinforcement Learning via Manifold Alignment

Authors: Haitham Bou Ammar, Eric Eaton, Paul Ruvolo, Matthew Taylor

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on diverse dynamical systems, including an application to quadrotor control, demonstrate its effectiveness for cross-domain transfer in the context of policy gradient RL.
Researcher Affiliation Academia Haitham Bou Ammar Univ. of Pennsylvania haithamb@seas.upenn.edu Eric Eaton Univ. of Pennsylvania eeaton@cis.upenn.edu Paul Ruvolo Olin College of Engineering paul.ruvolo@olin.edu Matthew E. Taylor Washington State Univ. taylorm@eecs.wsu.edu
Pseudocode Yes Algorithm 1 Manifold Alignment Cross-Domain Transfer for Policy Gradients (MAXDT-PG)
Open Source Code No The paper does not explicitly state that the source code for the described methodology is publicly available or provide a direct link to a code repository.
Open Datasets No The paper describes experiments conducted on simulated dynamical systems (Simple Mass Spring Damper, Cart Pole, Three-Link Cart Pole, Quadrotor) by generating traces and samples, rather than using a pre-existing, publicly available dataset for which access information is provided.
Dataset Splits No The paper describes generating 'traces' and 'samples' from simulated dynamical systems, and evaluating performance based on learning iterations, but does not specify explicit train/validation/test dataset splits with percentages, counts, or predefined partition methodologies.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup Yes Figure 3 shows MAXDT-PG s performance using varying numbers of source and target samples to learn χS. These results reveal that transfer-initialized policies outperform standard policy gradient initialization. Further, as the number of samples used to learn χS increases, so does both the initial and final performance in all domains. All initializations result in equal per-iteration computational cost. Therefore, MAXDT-PG both improves sample complexity and reduces wall-clock learning time. [...] Rewards were averaged over 500 traces collected from 150 initial states.