PlayVirtual: Augmenting Cycle-Consistent Virtual Trajectories for Reinforcement Learning

Authors: Tao Yu, Cuiling Lan, Wenjun Zeng, Mingxiao Feng, Zhizheng Zhang, Zhibo Chen

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of our designs on the Atari and Deep Mind Control Suite benchmarks. Our method achieves the state-of-the-art performance on both benchmarks.
Researcher Affiliation Collaboration Tao Yu1 Cuiling Lan2 Wenjun Zeng2 Mingxiao Feng1 Zhizheng Zhang2 Zhibo Chen1 1University of Science and Technology of China 2Microsoft Research Asia yutao666@mail.ustc.edu.cn, {culan,wezeng}@microsoft.com fmxustc@mail.ustc.edu.cn, zhizzhang@microsoft.com, chenzhibo@ustc.edu.cn
Pseudocode No The paper describes the methodology in text and equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/microsoft/Playvirtual.
Open Datasets Yes We evaluate our method on the commonly used discrete control benchmark of Atari [2], and the continuous control benchmark of DMControl [43].
Dataset Splits No The paper describes training and evaluation protocols (e.g., '100k interaction steps', '500k environment steps') and mentions using established benchmarks (Atari, DMControl), but it does not specify explicit numerical training, validation, or test dataset splits (e.g., percentages or sample counts for data partitions).
Hardware Specification No The paper does not explicitly provide details about the specific hardware (e.g., GPU or CPU models, memory) used for running its experiments.
Software Dependencies No All our models are implemented via Py Torch [39].
Experiment Setup Yes We set the number of prediction steps K to 9 by default... We simply set the number of action sets, i.e., the number of virtual trajectories M to 2|A|... We set K to 6, and set M to a fixed number 10... We set λpred = 1 and λcyc = 1. For d M, we use the distance metric as in SPR [40]... We follow the training settings in CURL except the batch size (reduced from 512 to 128 to save memory cost) and learning rate.