Towards Neural Architecture Search through Hierarchical Generative Modeling
Authors: Lichuan Xiang, Ćukasz Dudziak, Mohamed S Abdelfattah, Abhinav Mehrotra, Nicholas Donald Lane, Hongkai Wen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on the typical datasets: CIFAR-10, CIFAR-100 (Krizhevsky, 2009) and Image Net-1k (Deng et al., 2009), abbreviated C-10, C-100 and IN-1k, respectively. To test the generalizability of our method to lessstudied tasks, we also evaluate the recently proposed NASBench-360 (NB360, (Tu et al., 2022)). |
| Researcher Affiliation | Collaboration | 1University of Warwick, UK 2Samsung AI Centre Cambridge, UK 3Cornell University, US 4University of Cambridge, UK 5Flower Labs, UK. |
| Pseudocode | Yes | Algorithm 1 HL-Evo |
| Open Source Code | No | Code will be available upon acceptance. |
| Open Datasets | Yes | We evaluate our method on the typical datasets: CIFAR-10, CIFAR-100 (Krizhevsky, 2009) and Image Net-1k (Deng et al., 2009), abbreviated C-10, C-100 and IN-1k, respectively. To test the generalizability of our method to lessstudied tasks, we also evaluate the recently proposed NASBench-360 (NB360, (Tu et al., 2022)). |
| Dataset Splits | Yes | We report the best model we obtained from validation. |
| Hardware Specification | Yes | The total pretraining cost is around 30 GPU hours on a single V100 GPU |
| Software Dependencies | Yes | Notably, the network employs the dopri5 ODE solver (Hairer et al., 1993) with both absolute and relative tolerance set to 1e-5. For training the sequence generator, GPT-Neo-125M (Black et al., 2021) is employed as the foundation through pretrained checkpoints. The model consists of 12 transformer blocks (Vaswani et al., 2017) using GELU (Hendrycks & Gimpel, 2016) and Layer Norm (Ba et al., 2016). |
| Experiment Setup | Yes | The G-VAE is configured with four graph convolution layers (Kipf & Welling, 2017), utilizing a hidden state of size 512 and projecting to an output latent space of size 256 for the metrics predictor. This metrics predictor is structured as a two-layer MLP with Re LU activation. The first layer serves as a hidden state, mirroring the size of the latent space at 256, and the subsequent layer produces four conditions, aligning with our target outcomes. In terms of training, the metrics predictor undergoes joint training with the G-VAE. The Adam W optimizer (Loshchilov & Hutter, 2019), combined with a cosine annealing schedule (Loshchilov & Hutter, 2017), is employed, initiating with a learning rate of 1e-3 and decaying to a minimum value of 0. A weight decay is also incorporated, set at 5e-4. The training procedure is constrained by a maximum of 500,000 steps and incorporates an early stopping mechanism. This mechanism ceases training if a reduction in loss is not observed over a span of 10 epochs. The Continuous Conditional Normalizing Flow (CCNF) incorporates nine Concat+Squash+Linear layers, each having a hidden dimension of 512. The input layer is structured to accommodate a latent feature size of 256 and is engineered to project these latent features into a Gaussian distribution, preserving the identical dimensional size of 256. For training, we utilize 400,000 unlabeled graphs as our dataset. These graphs generate latent features through GVAE, which are further predicted into ZC vectors by the metrics predictor. The CCNF is trained with a fixed learning rate of 1e-3 and a batch size of 256. Weight decay is set at 0.01, serving as the default value for the Adam optimizer (Kingma & Ba, 2015). The latent features are then transformed into a standard Gaussian distribution, which acts as the prior distribution by minimizing log-likelihood. Notably, the network employs the dopri5 ODE solver (Hairer et al., 1993) with both absolute and relative tolerance set to 1e-5. For training the sequence generator, GPT-Neo-125M (Black et al., 2021) is employed as the foundation through pretrained checkpoints. The model consists of 12 transformer blocks (Vaswani et al., 2017) using GELU (Hendrycks & Gimpel, 2016) and Layer Norm (Ba et al., 2016). We finetune this model with a consistent learning rate of 1e-3, utilizing the 60,000 networks and zc-vectors gathered from the HL-Evo procedure and maintaining a batch size of 1. To augment the dataset and mitigate the risk of overfitting, we enhance our condition token by randomly dropping tokens from the set that includes Param, Flops, and ZC values. |