Hypernetworks for Zero-Shot Transfer in Reinforcement Learning
Authors: Sahand Rezaei-Shoshtari, Charlotte Morissette, Francois R. Hogan, Gregory Dudek, David Meger
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate the effectiveness of our method for zero-shot transfer to new reward and transition dynamics on a series of continuous control tasks from Deep Mind Control Suite. |
| Researcher Affiliation | Collaboration | 1Mc Gill University 2Mila Qu ebec AI Institute 3Samsung AI Center Montreal srezaei@cim.mcgill.ca |
| Pseudocode | Yes | Algorithm 1 shows the pseudo-code of our learning framework. |
| Open Source Code | Yes | Our learning code, generated datasets, and custom continuous control environments, which are built upon Deep Mind Control Suite, are publicly available at: https://sites.google.com/view/hyperzero-rl |
| Open Datasets | Yes | Our learning code, generated datasets, and custom continuous control environments, which are built upon Deep Mind Control Suite, are publicly available at: https://sites.google.com/view/hyperzero-rl |
| Dataset Splits | Yes | To reliably evaluate the zero-shot transfer abilities of Hyper Zero to novel reward/dynamics settings against the baselines, and to rule out the possibility of selective choosing of train/test tasks, we randomly divide task settings into train (%85) and test (%15) sets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using TD3 as the RL algorithm and Deep Mind Control Suite for environments but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | RL Training and Dataset Collection. We use TD3 (Fujimoto, Hoof, and Meger 2018) as the RL algorithm that is to be approximated. Each MDP Mi M , generated by sampling ψi p(ψ) and µi p(µ), is used to independently train a standard TD3 agent on proprioceptive states for 1 million steps. Consequently, the final solution is used to generate 10 rollouts to be added to the dataset D. ... Train/Test Split of the Tasks. ... we randomly divide task settings into train (%85) and test (%15) sets. We consequently report the mean and standard deviation of the average return obtained on 5 seeds. |