Hardware-adaptive Efficient Latency Prediction for NAS via Meta-Learning

Authors: Hayeon Lee, Sewoong Lee, Song Chong, Sung Ju Hwang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the proposed HELP for its latency estimation performance on unseen platforms, on which it achieves high estimation performance with as few as 10 measurement samples, outperforming all relevant baselines. We also validate end-to-end NAS frameworks using HELP against ones without it, and show that it largely reduces the total time cost of the base NAS method, in latency-constrained settings.
Researcher Affiliation Collaboration KAIST1, AITRICS2, Seoul, South Korea
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/HayeonLee/HELP.
Open Datasets Yes We validate the latency estimation performance of HELP on the NAS-Bench-201 space [9] with various devices from different hardware platforms, utilizing the latency dataset for an extensive device pool in HW-NAS-Bench dataset [18].
Dataset Splits Yes For WMT 14 En-De, we follow [33, 35] for training, validation, test setting of datasets.
Hardware Specification Yes To construct the Meta-Training Pool, we collect the latency measurements from 18 heterogeneous devices, including GPUs, CPUs, mobile devices (NVIDIA 1080ti, Titan X, Titan XP, RTX 2080ti, Xeon Silver 4114, Silver 4210r, Samsung A50, S7, Google Pixel3, Essential Ph 1). Unseen Devices include NVIDIA GPU Titan RTX, Intel CPU Xeon Gold 6226, and Google Pixel2, which are different from the devices in the meta-training pool but belong to the same categories (GPU, CPU, mobile device). On the other hand, Unseen Platforms include Jetson AGX Xavier, Raspi4, ASIC-Eyeriss, and FPGA, which are completely unseen categories of devices.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For GPUs, we consider three different batch sizes [1, 32, 256(64)] and for all other hardware devices, we use the batch size of 1. in our experiments, we set d = 10. θτ (t+1) = θτ (t) α θ(t)L(f(Xτ, V τ h ; θ(t)), Y τ) for t = 1, . . . , T where t denotes the tth inner gradient step, T is the total number of inner gradient steps, and α denotes the multi-dimensional global learning rate vector [19].