Distributional Successor Features Enable Zero-Shot Policy Optimization
Authors: Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a practical instantiation of Di SPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code are available at https://weirdlabuw.github.io/dispo/. |
| Researcher Affiliation | Academia | Chuning Zhu University of Washington zchuning@cs.washington.edu Xinqi Wang University of Washington wxqkaxdd@cs.washington.edu Tyler Han University of Washington than123@cs.washington.edu Simon Shaolei Du University of Washington ssdu@cs.washington.edu Abhishek Gupta University of Washington abhgupta@cs.washington.edu |
| Pseudocode | Yes | Appendix F Algorithm Pseudocode |
| Open Source Code | Yes | Videos and code are available at https://weirdlabuw.github.io/dispo/. |
| Open Datasets | Yes | We use the D4RL dataset for pretraining and dense rewards described in Appendix D for adaptation. ... We use the offline dataset from [9] for pretraining and shaped rewards for adaptation. ... D4RL: Datasets for deep data-driven reinforcement learning. https://arxiv.org/abs/2004.07219, 2020. |
| Dataset Splits | No | The paper explicitly mentions using 'offline dataset for pretraining' and adapting to 'test-time rewards', but it does not specify a distinct validation set or its split ratios/counts for hyperparameter tuning or model selection during its experimental setup. |
| Hardware Specification | Yes | Each experiment (pretraining + adaptation) takes 3 hours on a single Nvidia L40 GPU. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer [35]' and 'conditional DDIMs [47]' but does not provide specific version numbers for programming languages, libraries, or other software dependencies. |
| Experiment Setup | Yes | We set d = 128 for all of our experiments. ... The noise prediction network is implemented as a 1-D Unet with down dimensions [256, 512, 1024]. ... We train our models on the offline dataset for 100,000 gradient steps using the Adam W optimizer [35] with batch size 2048. The learning rate for the outcome model and the policy are set to 3e 4 and adjusted according to a cosine learning rate schedule with 500 warmup steps. |