Distributional Successor Features Enable Zero-Shot Policy Optimization

Authors: Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a practical instantiation of Di SPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code are available at https://weirdlabuw.github.io/dispo/.
Researcher Affiliation Academia Chuning Zhu University of Washington zchuning@cs.washington.edu Xinqi Wang University of Washington wxqkaxdd@cs.washington.edu Tyler Han University of Washington than123@cs.washington.edu Simon Shaolei Du University of Washington ssdu@cs.washington.edu Abhishek Gupta University of Washington abhgupta@cs.washington.edu
Pseudocode Yes Appendix F Algorithm Pseudocode
Open Source Code Yes Videos and code are available at https://weirdlabuw.github.io/dispo/.
Open Datasets Yes We use the D4RL dataset for pretraining and dense rewards described in Appendix D for adaptation. ... We use the offline dataset from [9] for pretraining and shaped rewards for adaptation. ... D4RL: Datasets for deep data-driven reinforcement learning. https://arxiv.org/abs/2004.07219, 2020.
Dataset Splits No The paper explicitly mentions using 'offline dataset for pretraining' and adapting to 'test-time rewards', but it does not specify a distinct validation set or its split ratios/counts for hyperparameter tuning or model selection during its experimental setup.
Hardware Specification Yes Each experiment (pretraining + adaptation) takes 3 hours on a single Nvidia L40 GPU.
Software Dependencies No The paper mentions using 'Adam W optimizer [35]' and 'conditional DDIMs [47]' but does not provide specific version numbers for programming languages, libraries, or other software dependencies.
Experiment Setup Yes We set d = 128 for all of our experiments. ... The noise prediction network is implemented as a 1-D Unet with down dimensions [256, 512, 1024]. ... We train our models on the offline dataset for 100,000 gradient steps using the Adam W optimizer [35] with batch size 2048. The learning rate for the outcome model and the policy are set to 3e 4 and adjusted according to a cosine learning rate schedule with 500 warmup steps.