reproducibilityindex.ai

CATE: Computation-aware Neural Architecture Encoding with Transformers

Authors: Shen Yan, Kaiqiang Song, Fei Liu, Mi Zhang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare CATE with eleven encodings under three major encoding-dependent NAS subroutines in both small and large search spaces. Our experiments show that CATE is beneﬁcial to the downstream search, especially in the large search space. Moreover, the outside search space experiment demonstrates its superior generalization ability beyond the search space on which it was trained.
Researcher Affiliation	Collaboration	1Michigan State University 2University of Central Florida 3Tencent AI Lab.
Pseudocode	Yes	Algorithm 1 Floyd Algorithm
Open Source Code	Yes	Our code is available at: https://github.com/MSU-MLSys Lab/CATE.
Open Datasets	Yes	NAS-Bench-101. The NAS-Bench-101 search space (Ying et al., 2019) consists of 423, 624 architectures. NAS-Bench-301 (Siems et al., 2020) is a new surrogate benchmark on the DARTS (Liu et al., 2019a) search space
Dataset Splits	Yes	NAS-Bench-101. The NAS-Bench-101 search space (Ying et al., 2019) consists of 423, 624 architectures. Each architecture has its pre-computed validation and test accuracies on CIFAR-10. ... We split the dataset into 95% training and 5% held-out test sets for pre-training.
Hardware Specification	Yes	We trained our model with batch size of 1024 on NVIDIA Quadro RTX 8000 GPUs.
Software Dependencies	No	The paper mentions "Adam W" as an optimizer but does not specify software names with version numbers for libraries or frameworks used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We use a L = 12 layer Transformer encoder and a Lc = 24 layer cross-attention Transformer encoder, each has 8 attention heads. The hidden state size is dh = dc = 64 for all the encoders. The hidden dimension is dff = 64 for all the feed-forward layers. We employ Adam W (Loshchilov & Hutter, 2019) as our optimizer. The initial learning rate is 1e-3. The momentum parameters are set to 0.9 and 0.999. The weight decay is 0.01 for regular layer and 0 for dropout and layer normalization. ... Each queried architecture is trained for 50 epochs with a batch size of 96, using 32 initial channels and 8 cell layers.