Understanding Architectures Learnt by Cell-based Neural Architecture Search

Authors: Yao Shu, Wei Wang, Shaofeng Cai

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We sample and train variants of popular NAS architectures with random connections. Comparing randomly connected variants with the popular NAS architectures, we find that architectures with wider and shallower cells indeed converge faster so that they are selected by NAS algorithms (Section 4.1). To understand why the wider and shallower cell contributes to faster convergence, we further investigate the loss landscape and gradient variance of popular NAS architectures and their variants via both empirical experiments (Section 4.2) and theoretical analysis (Section 4.3).
Researcher Affiliation Academia Yao Shu, Wei Wang & Shaofeng Cai School of Computing National University of Singapore {shuyao,wangwei,shaofeng}@comp.nus.edu.sg
Pseudocode No The paper describes methods in text and mathematical formulations but does not include pseudocode or algorithm blocks.
Open Source Code No The paper does not include any explicit statements or links indicating the release of source code for the described methodology.
Open Datasets Yes Our experiments are conducted on CIFAR-10/100 (Krizhevsky et al., 2009) and Tiny-Image Net-200.
Dataset Splits Yes CIFAR-10/100 contains 50,000 training images and 10,000 test images of 32 32 pixels in 10 and 100 classes respectively. Tiny-Image Net-200 consists of 100,000 training images, 10,000 validation images and 10,000 test images5 in 200 classes.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments.
Software Dependencies No The paper mentions optimization methods like stochastic gradient descent (SGD) but does not list specific software dependencies with version numbers (e.g., libraries, frameworks).
Experiment Setup Yes In the default training setting, we apply stochastic gradient descent (SGD) with learning rate 0.025, momentum 0.9, weight decay 3 10 4 and batch size 80 to train the models for 600 epochs on CIFAR10/100 and 300 epochs on Tiny-Image Net-200 to ensure the convergence. The learning rate is gradually annealed to zero following the standard cosine annealing schedule.