Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GradSign: Model Performance Inference with Theoretical Insights

Authors: Zhihao Zhang, Zhihao Jia

ICLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation on seven NAS benchmarks across three training datasets shows that Grad Sign generalizes well to real-world neural networks and consistently outperforms state-of-the-art gradient-based methods for MPI evaluated by Spearman s ρ and Kendall s Tau.
Researcher Affiliation Academia Zhihao Zhang Carnegie Mellon University EMAIL Zhihao Jia Carnegie Mellon University EMAIL
Pseudocode Yes Algorithm 1: Grad Sign Result: Grad Sign score τf for a function class fθ Given S = {(xi, yi)}i [n], randomly select initialization point θ0; Initialize g[n, m]; for i = 1, 2, , n do for k = 1, 2, , m do g[i, k] = sign([ θl(fθ(xi), yi)|θ0]k) end end τf = P i g[i, k]|; return τf
Open Source Code Yes Code is available at https://github.com/Jack Fram/Grad Sign
Open Datasets Yes Evaluation on seven NAS benchmarks (i.e., NAS-Bench-101, NAS-Bench-201, and five design spaces of NDS) across three datasets (i.e., CIFAR-10, CIFAR-100, and Image Net16-120)
Dataset Splits Yes Table 5: Mean std accuracy evaluated on NAS-Bench-201. All results are averaged over 500 runs. All searches are conducted on CIFAR-10 while the selected architectures are evaluated on CIFAR-10, CIFAR-100, and Image Net16-120. N in parenthesis is the number of networks sampled in each run.
Hardware Specification Yes The hardwares we used were Amazon EC2 C5 instances with no GPU involved and p3 instance with one V100 Tensor Core GPU.
Software Dependencies No The paper mentions 'Py Torch (Paszke et al., 2017) and Tensor Flow (Abadi et al., 2016)' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We use a randomly sampled subset with approximately 4500 architectures of the original search space and a batch size of 64 in this experiment.