Understanding Architectures Learnt by Cell-based Neural Architecture Search
Authors: Yao Shu, Wei Wang, Shaofeng Cai
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We sample and train variants of popular NAS architectures with random connections. Comparing randomly connected variants with the popular NAS architectures, we find that architectures with wider and shallower cells indeed converge faster so that they are selected by NAS algorithms (Section 4.1). To understand why the wider and shallower cell contributes to faster convergence, we further investigate the loss landscape and gradient variance of popular NAS architectures and their variants via both empirical experiments (Section 4.2) and theoretical analysis (Section 4.3). |
| Researcher Affiliation | Academia | Yao Shu, Wei Wang & Shaofeng Cai School of Computing National University of Singapore {shuyao,wangwei,shaofeng}@comp.nus.edu.sg |
| Pseudocode | No | The paper describes methods in text and mathematical formulations but does not include pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any explicit statements or links indicating the release of source code for the described methodology. |
| Open Datasets | Yes | Our experiments are conducted on CIFAR-10/100 (Krizhevsky et al., 2009) and Tiny-Image Net-200. |
| Dataset Splits | Yes | CIFAR-10/100 contains 50,000 training images and 10,000 test images of 32 32 pixels in 10 and 100 classes respectively. Tiny-Image Net-200 consists of 100,000 training images, 10,000 validation images and 10,000 test images5 in 200 classes. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper mentions optimization methods like stochastic gradient descent (SGD) but does not list specific software dependencies with version numbers (e.g., libraries, frameworks). |
| Experiment Setup | Yes | In the default training setting, we apply stochastic gradient descent (SGD) with learning rate 0.025, momentum 0.9, weight decay 3 10 4 and batch size 80 to train the models for 600 epochs on CIFAR10/100 and 300 epochs on Tiny-Image Net-200 to ensure the convergence. The learning rate is gradually annealed to zero following the standard cosine annealing schedule. |