DiffusionNAG: Predictor-guided Neural Architecture Generation with Diffusion Models

Authors: Sohyun An, Hayeon Lee, Jaehyeong Jo, Seanie Lee, Sung Ju Hwang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of Diffusion NAG through extensive experiments in two predictor-based NAS scenarios: Transferable NAS and Bayesian Optimization (BO)-based NAS. Diffusion NAG achieves superior performance with speedups of up to 35 when compared to the baselines on Transferable NAS benchmarks. Furthermore, when integrated into a BO-based algorithm, Diffusion NAG outperforms existing BO-based NAS approaches, particularly in the large Mobile Net V3 search space on the Image Net 1K dataset. Code is available at https://github.com/Cownow An/Diffusion NAG.
Researcher Affiliation Collaboration KAIST1, Deep Auto.ai2, Seoul, South Korea {sohyunan, hayeon926, harryjo97, lsnfamily02, sjhwang82}@kaist.ac.kr
Pseudocode Yes Algorithm 1: General Bayesian Optimization NAS and Algorithm 2: Bayesian Optimization with Diffusion NAG are provided in Appendix C.6.
Open Source Code Yes Code is available at https://github.com/Cownow An/Diffusion NAG.
Open Datasets Yes We evaluate our approach on four datasets following Lee et al. (2021a): CIFAR-10 (Krizhevsky, 2009), CIFAR100 (Krizhevsky, 2009), Aircraft (Maji et al., 2013), and Oxford IIT Pets (Parkhi et al., 2012)
Dataset Splits Yes For CIFAR-10 and CIFAR-100, we use the predefined splits from the NAS-Bench-201 benchmark. For Aircraft and Oxford-IIIT Pets, we create random validation and test splits by dividing the test set into two equal-sized subsets.
Hardware Specification Yes The training process required 21.33 GPU hours (MBv3) and 3.43 GPU hours (NB201) on Tesla V100-SXM2, respectively. Our generation process, with a sampling batch size of 256, takes up to 2.02 GPU minutes on Tesla V100-SXM2 to sample one batch.
Software Dependencies No The paper mentions software components like
Experiment Setup Yes Following the training pipeline presented in Dong & Yang (2020b), we train each architecture using SGD with Nesterov momentum and employ the cross-entropy loss for 200 epochs. For regularization, we set the weight decay to 0.0005 and decay the learning rate from 0.1 to 0 using a cosine annealing schedule (Loshchilov & Hutter, 2016). We maintain consistency by utilizing the same set of hyperparameters across different datasets.