Efficient Lifelong Model Evaluation in an Era of Rapid Progress

Authors: Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel Bibi, Samuel Albanie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical evaluations across 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours ( 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB.
Researcher Affiliation Academia 1Tübingen AI Center, University of Tübingen 2University of Cambridge 3University of Oxford
Pseudocode Yes Here, we provide pythonic-pseudo code for the constituent algorithms of Sort & Search, which we described in detail in Section 3.
Open Source Code Yes https://github.com/bethgelab/sort-and-search
Open Datasets Yes For Lifelong-CIFAR10, we use 31, 250 CIFAR-10 pre-trained models from the NATS-Bench-Topology-search space [25]. For Lifelong-Image Net, we use 167 Image Net-1K and Image Net-21K pre-trained models, sourced primarily from timm [98] and imagenet-testbed [84].
Dataset Splits No The paper describes splits for evaluating its framework's performance (e.g., 'Sample Addition Split (1 insert D)', 'Model Evaluation Split (2 insert M)'), but it does not provide traditional training/validation/test dataset splits for model training, as it primarily evaluates pre-trained models.
Hardware Specification Yes reducing compute cost from 180 GPU days to 5 GPU hours ( 1000x reduction) on a single A100 GPU
Software Dependencies No The paper mentions software like 'timm [98]' and 'Pythonic Pseudo-code', but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We run our S&S over 13 different sampling budgets: {8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768} on both Lifelong-Image Net and Lifelong-CIFAR10.