Weight Diffusion for Future: Learn to Generalize in Non-Stationary Environments

Authors: Mixue Xie, Shuang Li, Binhui Xie, Chi Liu, Jian Liang, Zixun Sun, Ke Feng, Chengwei Zhu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on both synthetic and real-world datasets show the superior generalization performance of W-Diff on unseen domains in the future.
Researcher Affiliation Collaboration Mixue Xie Beijing Institute of Technology mxxie@bit.edu.cn ... Jian Liang Kuaishou Technology liangjian03@kuaishou.com
Pseudocode Yes Algorithm 1: Training procedure for W-Diff ... Algorithm 2: Testing procedure for W-Diff
Open Source Code Yes Code is available at https://github.com/BIT-DA/W-Diff.
Open Datasets Yes Benchmark Datasets. We evaluate W-Diff on both synthetic and real-world datasets [2, 48], including two text classification datasets (Huffpost, Arxiv), three image classification datasets (Yearbook, RMNIST, f Mo W) and two multivariate classification datasets (2-Moons, ONP). ... For more details on datasets, please refer to Appendix D.1.
Dataset Splits Yes For each source domain, we randomly divide the data into training and validation sets in the ratio of 9 : 1.
Hardware Specification Yes All experiments are conducted using the Py Torch packages and run on a single NVIDIA Ge Force RTX 4090 GPU with 24GB memory.
Software Dependencies No The paper mentions 'Py Torch packages' but does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes For all datasets, we set the batch size B = 64, the loss tradeoff λ = 10 and the maximum length L = 8 for the reference point queue Qr. To optimize the task model, we adopt the Adam optimizer with momentum 0.9. As for the warm-up hyperparameter ρ, we ρ = 0.6 for Huffpost, f Mo W and ρ = 0.2 for Arxiv, Yearbook, RMNIST, 2-Moons, ONP. For the conditional diffusion model, we set the maximum diffusion step S = 1000 and use the Adam W optimizer with batch size M = 32... Training details on different datasets are given in Table 8.