Offline Behavior Distillation

Authors: Shiye Lei, Sen Zhang, Dacheng Tao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple D4RL datasets reveal that Av-PBC offers significant improvements in OBD performance, fast distillation convergence speed, and robust cross-architecture/optimizer generalization.
Researcher Affiliation Academia Shiye Lei School of Computer Science The University of Sydney shiye.lei@sydney.edu.au Sen Zhang School of Computer Science The University of Sydney sen.zhang@sydney.edu.au Dacheng Tao College of Computing & Data Science Nanyang Technological University dacheng.tao@ntu.edu.sg
Pseudocode Yes Algorithm 1: Action-value weighted PBC
Open Source Code Yes The code is available at https://github.com/Leaves Lei/OBD.
Open Datasets Yes We conduct offline behavior distillation on D4RL [Fu et al., 2020], a widely used offline RL benchmark.
Dataset Splits No The paper uses D4RL datasets but does not explicitly state how these datasets are split into training, validation, and test sets for the authors' specific experimental setup, beyond using the full dataset for Cal-QL and synthesizing Dsyn for BC training.
Hardware Specification Yes OBD process is still computationally expensive (25 hours for 50k distillation steps on a single NVIDIA V100 GPU)
Software Dependencies No The paper mentions using Cal-QL and Standard SGD but does not provide specific version numbers for these or other software dependencies like programming languages or deep learning frameworks.
Experiment Setup Yes A four-layer MLP serves as the default architecture for policy networks. The size of synthetic data Nsyn is set to 256. Standard SGD is employed in both inner and outer optimization, and learning rates α0 = 0.1 and α1 = 0.1 for the inner and outer loop, respectively, and corresponding momentum rates β0 = 0 and β1 = 0.9.