CATE: Computation-aware Neural Architecture Encoding with Transformers

Authors: Shen Yan, Kaiqiang Song, Fei Liu, Mi Zhang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare CATE with eleven encodings under three major encoding-dependent NAS subroutines in both small and large search spaces. Our experiments show that CATE is beneficial to the downstream search, especially in the large search space. Moreover, the outside search space experiment demonstrates its superior generalization ability beyond the search space on which it was trained.
Researcher Affiliation Collaboration 1Michigan State University 2University of Central Florida 3Tencent AI Lab.
Pseudocode Yes Algorithm 1 Floyd Algorithm
Open Source Code Yes Our code is available at: https://github.com/MSU-MLSys Lab/CATE.
Open Datasets Yes NAS-Bench-101. The NAS-Bench-101 search space (Ying et al., 2019) consists of 423, 624 architectures. NAS-Bench-301 (Siems et al., 2020) is a new surrogate benchmark on the DARTS (Liu et al., 2019a) search space
Dataset Splits Yes NAS-Bench-101. The NAS-Bench-101 search space (Ying et al., 2019) consists of 423, 624 architectures. Each architecture has its pre-computed validation and test accuracies on CIFAR-10. ... We split the dataset into 95% training and 5% held-out test sets for pre-training.
Hardware Specification Yes We trained our model with batch size of 1024 on NVIDIA Quadro RTX 8000 GPUs.
Software Dependencies No The paper mentions "Adam W" as an optimizer but does not specify software names with version numbers for libraries or frameworks used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We use a L = 12 layer Transformer encoder and a Lc = 24 layer cross-attention Transformer encoder, each has 8 attention heads. The hidden state size is dh = dc = 64 for all the encoders. The hidden dimension is dff = 64 for all the feed-forward layers. We employ Adam W (Loshchilov & Hutter, 2019) as our optimizer. The initial learning rate is 1e-3. The momentum parameters are set to 0.9 and 0.999. The weight decay is 0.01 for regular layer and 0 for dropout and layer normalization. ... Each queried architecture is trained for 50 epochs with a batch size of 96, using 32 initial channels and 8 cell layers.