Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Per-Architecture Training-Free Metric Optimization for Neural Architecture Search

Authors: Mingzhuo Lin, Jianping Luo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the effectiveness of our approach. Our code has been made publicly available at https://github.com/LMZ-Zhuo/PO-NAS. 4 Experiments 4.1 PO-NAS on NAS Benchmark To validate the robustness and exceptional predictive performance of PO-NAS across various tasks, we conduct comprehensive experiments on multiple popular NAS benchmarks. These benchmarks include 20 distinct training tasks: NAS-Bench-201 (21), Trans NAS-Bench-101 (34), and DARTS (11). PO-NAS utilizes 6 metrics outlined by (5): grad_norm, snip, grasp, fisher, synflow, and jacob_cov. Implementation details and more empirical results can be found in Appendix B and Appendix C. Table 1, Table 2, Table 3 and Table 4 present the performance results of PO-NAS on NAS-Bench-201, DARTS and Trans NAS-Bench-101. 4.2 Ablation Studies We conduct a series of ablation studies on PO-NAS to investigate the impact of different components and parameter settings.
Researcher Affiliation	Academia	1 Guangdong Key Laboratory of Intelligent Information Processing, College of Electronic and Information Engineering, Shenzhen University, China
Pseudocode	Yes	Algorithm 1 Pseudo code for PO-NAS Algorithm 2 calculation minimal operation path and cost of two nodes
Open Source Code	Yes	Our code has been made publicly available at https://github.com/LMZ-Zhuo/PO-NAS.
Open Datasets	Yes	To validate the robustness and exceptional predictive performance of PO-NAS across various tasks, we conduct comprehensive experiments on multiple popular NAS benchmarks. These benchmarks include 20 distinct training tasks: NAS-Bench-201 (21), Trans NAS-Bench-101 (34), and DARTS (11). For NAS-Bench-201 and Trans NAS-Bench-101, to assess the effectiveness of our method through ablation studies, we adopt the same training-free metrics as those used in the Ro Bo T experiment. These metrics include the six trainingfree metrics outlined by (5): grad_norm, snip, grasp, fisher, synflow, and jacob_cov. To ensure the reproducibility of the experimental results, we utilize NAS-Bench-Suite-Zero (13) to calculate these training-free metrics for both NAS-Bench-201 and Trans NAS-Bench-101. These architectures have been evaluated across three datasets: CIFAR-10 (C10) (55), CIFAR-100 (C100), and Image Net-16-120 (IN-16) (56).
Dataset Splits	Yes	During the pre-training phase, we divide the training dataset for the architecture encoder into training and test sets in a 1 : 4 ratio and randomly mask 20% architectures for the mask reconstruction task. For NAS-Bench-201, we maintain the experimental conditions consistent with Ro Bo T and HNAS, using the CIFAR-10 validation performance after 12 training epochs from the table data in NAS-Bench-201 as the objective evaluation metric for all three datasets, and calculate the search costs displayed in the same manner (i.e., the training cost of 20 architectures). However, we report the full training test accuracy of the proposed architectures after 200 epochs. For the training tasks Segmentation, Normal, and Autoencoding in Trans NAS-Bench-101, to maintain consistency in experimental conditions, we do not use the training-free metric Synflow and only employ the remaining five training-free metrics. For Trans NAS-Bench-101, we only report the validation performance of the architectures identified through the search process. For Image Net, we train a 14-layer architecture from scratch for 250 epochs with a batch size of 1024.
Hardware Specification	Yes	The search costs are evaluated on an Nvidia 1080Ti.
Software Dependencies	No	The paper mentions code availability but does not specify software dependencies with version numbers.
Experiment Setup	Yes	During the pre-training phase, we divide the training dataset for the architecture encoder into training and test sets in a 1 : 4 ratio and randomly mask 20% architectures for the mask reconstruction task. We train the architecture encoder using the initial architectures and their corresponding training-free metrics, employing stochastic gradient descent (SGD) over 100 epochs. In the first 5 epochs, the learning rate is initially increased to 5 * 10^-3 and then gradually reduced to 0 according to a cosine annealing schedule, with a batch size of 64. In the Bayesian Optimization (BO) phase, we set a loss threshold of 0.1 (for some tasks in Trans NAS-Bench-101, adjustments are made due to the peculiarity of the loss metric values) and a maximum number of iterations of 100. Each iteration begins with a training period of 100 epochs, increasing by 10% per iteration, and the model weights are reset after each iteration until the model reaches the loss threshold or the maximum number of iterations are reached, after which the best model weights are saved. The loss difference threshold Tth is set to 0.1 (adjustments are made for some tasks in Trans NAS-Bench-101 due to the peculiarity of the loss metric values). We use the Adaptive Moment Estimation (Adam) optimizer to train the surrogate model. In the first 10 epochs, the learning rate is initially increased to 3 * 10^-4 and then gradually reduced to 0 according to a cosine annealing schedule, with a weight decay of 0.01. Additionally, we employ gradient clipping with max norm = 1 to prevent gradient explosion. For the CIFAR-10 and CIFAR-100 datasets, we allocate a budget of 25 search attempts for PO-NAS, with each optimal architecture identified undergoing 10 epochs of training. For the Image Net dataset, we set a budget of 10 search attempts, with each optimal architecture undergoing 3 epochs of training. We select three initial architectures to initialize the surrogate model, which are the top three based on the average scores of the training-free metrics. We initiate the evolutionary algorithm at the 10th epoch (3rd epoch on Image Net) of the Bayesian Optimization (BO) phase. Following the experimental setup of DARTS (11), we construct a 20-layer network architecture based on the identified cell structures. The initial number of channels for these architectures is set to 36, with the auxiliary tower weight set to 0.4 for CIFAR-10, located at the 13th layer; for CIFAR-100, the auxiliary tower weight is set to 0.6. We test these architectures on CIFAR-10 and CIFAR-100 through 600 epochs of stochastic gradient descent (SGD). The learning rate starts at 0.025, gradually decreasing to 0 for CIFAR-10 and from 0.035 to 0.001 for CIFAR-100, using a cosine annealing strategy. Momentum is set to 0.9, weight decay to 3 * 10^-4, and batch size to 96. Additionally, we employ Cutout (64) and Scheduled Drop Path as regularization techniques, which are linearly increased from 0 to 0.2 for CIFAR-10 and from 0 to 0.3 for CIFAR-100. For Image Net, we train a 14-layer architecture from scratch for 250 epochs with a batch size of 1024. In the first five epochs, the learning rate is initially increased to 0.7, then gradually decreases to zero according to a cosine schedule. When using the SGD optimizer, momentum is 0.9, and weight decay is 3 * 10^-5.