CATE: Computation-aware Neural Architecture Encoding with Transformers
Authors: Shen Yan, Kaiqiang Song, Fei Liu, Mi Zhang
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare CATE with eleven encodings under three major encoding-dependent NAS subroutines in both small and large search spaces. Our experiments show that CATE is beneļ¬cial to the downstream search, especially in the large search space. Moreover, the outside search space experiment demonstrates its superior generalization ability beyond the search space on which it was trained. |
| Researcher Affiliation | Collaboration | 1Michigan State University 2University of Central Florida 3Tencent AI Lab. |
| Pseudocode | Yes | Algorithm 1 Floyd Algorithm |
| Open Source Code | Yes | Our code is available at: https://github.com/MSU-MLSys Lab/CATE. |
| Open Datasets | Yes | NAS-Bench-101. The NAS-Bench-101 search space (Ying et al., 2019) consists of 423, 624 architectures. NAS-Bench-301 (Siems et al., 2020) is a new surrogate benchmark on the DARTS (Liu et al., 2019a) search space |
| Dataset Splits | Yes | NAS-Bench-101. The NAS-Bench-101 search space (Ying et al., 2019) consists of 423, 624 architectures. Each architecture has its pre-computed validation and test accuracies on CIFAR-10. ... We split the dataset into 95% training and 5% held-out test sets for pre-training. |
| Hardware Specification | Yes | We trained our model with batch size of 1024 on NVIDIA Quadro RTX 8000 GPUs. |
| Software Dependencies | No | The paper mentions "Adam W" as an optimizer but does not specify software names with version numbers for libraries or frameworks used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We use a L = 12 layer Transformer encoder and a Lc = 24 layer cross-attention Transformer encoder, each has 8 attention heads. The hidden state size is dh = dc = 64 for all the encoders. The hidden dimension is dff = 64 for all the feed-forward layers. We employ Adam W (Loshchilov & Hutter, 2019) as our optimizer. The initial learning rate is 1e-3. The momentum parameters are set to 0.9 and 0.999. The weight decay is 0.01 for regular layer and 0 for dropout and layer normalization. ... Each queried architecture is trained for 50 epochs with a batch size of 96, using 32 initial channels and 8 cell layers. |