VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning
Authors: Che Wang, Xufang Luo, Keith Ross, Dongsheng Li
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On a set of challenging hand manipulation tasks with sparse reward and realistic visual inputs, compared to the previous SOTA, VRL3 achieves an average of 780% better sample efficiency. And on the hardest task, VRL3 is 1220% more sample efficient (2440% when using a wider encoder) and solves the task with only 10% of the computation. These significant results clearly demonstrate the great potential of data-driven deep reinforcement learning. |
| Researcher Affiliation | Collaboration | Che Wang1,2 Xufang Luo3 Keith Ross1 Dongsheng Li3 1 New York University Shanghai 2 New York University 3 Microsoft Research Asia, Shanghai, China |
| Pseudocode | Yes | We focus on the high-level ideas and provide additional technical details in Appendix A and pseudocode in appendix C. |
| Open Source Code | No | (Source code is being reviewed and cleaned and will be put on Github soon). We provide source code 2 and a full set of technical details to maximize reproducibility. (Footnote 2 points to https://sites.google.com/nyu.edu/vrl3) |
| Open Datasets | Yes | In stage 1, we learn from large, existing non-RL datasets such as the Image Net dataset. Note that for Adroit, we have a standard 25 expert demonstrations per task (collected by human users with VR) [66]. For DMC, we collect 25K data with fully trained Dr Qv2 agents to enable stage 2 training. |
| Dataset Splits | No | The paper does not explicitly provide specific percentages or counts for training/validation/test dataset splits, nor does it explicitly reference standard predefined splits for the RL environments used beyond general mention of ImageNet. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions building upon Dr Qv2 and standard libraries (e.g., for ResNet), but does not provide specific version numbers for software dependencies like Python, PyTorch, or other key packages. |
| Experiment Setup | Yes | We focus on the high-level ideas and provide additional technical details in Appendix A and pseudocode in appendix C. Let α be the learning rate for the policy network and the Q networks. Let βenc be the encoder learning rate scale, so that the encoder learning rate is αenc = βencα. For all tasks, we set a maximum Q target value. We also use Polyak averaging hyperparameter τ to update target networks, as is typically done. |