PlayVirtual: Augmenting Cycle-Consistent Virtual Trajectories for Reinforcement Learning
Authors: Tao Yu, Cuiling Lan, Wenjun Zeng, Mingxiao Feng, Zhizheng Zhang, Zhibo Chen
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of our designs on the Atari and Deep Mind Control Suite benchmarks. Our method achieves the state-of-the-art performance on both benchmarks. |
| Researcher Affiliation | Collaboration | Tao Yu1 Cuiling Lan2 Wenjun Zeng2 Mingxiao Feng1 Zhizheng Zhang2 Zhibo Chen1 1University of Science and Technology of China 2Microsoft Research Asia yutao666@mail.ustc.edu.cn, {culan,wezeng}@microsoft.com fmxustc@mail.ustc.edu.cn, zhizzhang@microsoft.com, chenzhibo@ustc.edu.cn |
| Pseudocode | No | The paper describes the methodology in text and equations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/microsoft/Playvirtual. |
| Open Datasets | Yes | We evaluate our method on the commonly used discrete control benchmark of Atari [2], and the continuous control benchmark of DMControl [43]. |
| Dataset Splits | No | The paper describes training and evaluation protocols (e.g., '100k interaction steps', '500k environment steps') and mentions using established benchmarks (Atari, DMControl), but it does not specify explicit numerical training, validation, or test dataset splits (e.g., percentages or sample counts for data partitions). |
| Hardware Specification | No | The paper does not explicitly provide details about the specific hardware (e.g., GPU or CPU models, memory) used for running its experiments. |
| Software Dependencies | No | All our models are implemented via Py Torch [39]. |
| Experiment Setup | Yes | We set the number of prediction steps K to 9 by default... We simply set the number of action sets, i.e., the number of virtual trajectories M to 2|A|... We set K to 6, and set M to a fixed number 10... We set λpred = 1 and λcyc = 1. For d M, we use the distance metric as in SPR [40]... We follow the training settings in CURL except the batch size (reduced from 512 to 128 to save memory cost) and learning rate. |