AutoGO: Automated Computation Graph Optimization for Neural Network Evolution

Authors: Mohammad Salameh, Keith Mills, Negar Hassanpour, Fred Han, Shuting Zhang, Wei Lu, Shangling Jui, CHUNHUA ZHOU, Fengyu Sun, Di Niu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that Auto GO can automatically evolve several typical large convolutional networks to achieve significant task performance improvement and FLOPs reduction on a range of CV tasks, ranging from Classification, Semantic Segmentation, Human Pose Estimation, to Super Resolution, yet without introducing any newer primitive operations. We also demonstrate the lightweight deployment results of Auto GOoptimized super-resolution and denoising U-Nets on a cycle simulator for a Neural Processing Unit (NPU), achieving PSNR improvement and latency/power reduction simultaneously.
Researcher Affiliation Collaboration 1Huawei Technologies Canada. 2Dept. ECE, University of Alberta. 3Huawei Kirin Solution, China.
Pseudocode Yes Algorithm 1 Sample Auto GO pseudocode for one iteration
Open Source Code Yes Code available at https://github.com/Ascend-Research/Auto GO.
Open Datasets Yes We construct our database by extracting segments from 5 CIFAR-10 [33] benchmark families: NASBench-101 [71], NAS-Bench-201 [17], Hi AML, Inception, and Two-Path [48].", "We train each network on Image Net [58]. Then, we fine-tune the network on different tasks. For Semantic Segmentation (SS), we use a PSPNet [76] head structure and fine-tune on Cityscapes [14] to obtain mean Intersection over Union (m Io U) performance. For Human Pose Estimation (HPE), we adopt the method of [78] to fine-tune on MPII [4] to measure the Percentage of Correct Keypoints (PCK) of an architecture.
Dataset Splits Yes We split each family into training, validation, and testing partitions containing 80%, 10% and 10% of the overall CGs in that family.
Hardware Specification Yes We run our experiments on rack servers using Intel Xeon Gold 6140 CPUs. Each server is equipped with 8 NVIDIA V100 32GB GPUs and 756GB RAM. ... We measure latency on an Nvidia RTX 2080 Ti GPU...
Software Dependencies Yes We execute our search and experiments on Python 3 using Py Torch==1.8.1 and Tensor Flow==1.15.0. We implement our predictors using Py Torch-Geometric==1.7.1. We use Sentence Piece [34] to perform BPE. Finally, we implement our MILP using a Coin-CBC solver [18] and pyomo==6.4.0 [23].
Experiment Setup Yes We train our predictors for 40 epochs with a batch size of 32 and an initial learning rate of 1e 4. ... We evaluate CIFAR-10 networks by training them 3 times for 200 epochs with a batch size of 256. We optimize the models using RMSProp with an initial learning rate of 1e 3 and a momentum factor of 0.9. We anneal the learning rate according to a cosine schedule.