Unsupervised Cross-Domain Transfer in Policy Gradient Reinforcement Learning via Manifold Alignment
Authors: Haitham Bou Ammar, Eric Eaton, Paul Ruvolo, Matthew Taylor
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on diverse dynamical systems, including an application to quadrotor control, demonstrate its effectiveness for cross-domain transfer in the context of policy gradient RL. |
| Researcher Affiliation | Academia | Haitham Bou Ammar Univ. of Pennsylvania haithamb@seas.upenn.edu Eric Eaton Univ. of Pennsylvania eeaton@cis.upenn.edu Paul Ruvolo Olin College of Engineering paul.ruvolo@olin.edu Matthew E. Taylor Washington State Univ. taylorm@eecs.wsu.edu |
| Pseudocode | Yes | Algorithm 1 Manifold Alignment Cross-Domain Transfer for Policy Gradients (MAXDT-PG) |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is publicly available or provide a direct link to a code repository. |
| Open Datasets | No | The paper describes experiments conducted on simulated dynamical systems (Simple Mass Spring Damper, Cart Pole, Three-Link Cart Pole, Quadrotor) by generating traces and samples, rather than using a pre-existing, publicly available dataset for which access information is provided. |
| Dataset Splits | No | The paper describes generating 'traces' and 'samples' from simulated dynamical systems, and evaluating performance based on learning iterations, but does not specify explicit train/validation/test dataset splits with percentages, counts, or predefined partition methodologies. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | Figure 3 shows MAXDT-PG s performance using varying numbers of source and target samples to learn χS. These results reveal that transfer-initialized policies outperform standard policy gradient initialization. Further, as the number of samples used to learn χS increases, so does both the initial and final performance in all domains. All initializations result in equal per-iteration computational cost. Therefore, MAXDT-PG both improves sample complexity and reduces wall-clock learning time. [...] Rewards were averaged over 500 traces collected from 150 initial states. |