reproducibilityindex.ai

Efficient Lifelong Model Evaluation in an Era of Rapid Progress

Authors: Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel Bibi, Samuel Albanie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical evaluations across 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours ( 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB.
Researcher Affiliation	Academia	1Tübingen AI Center, University of Tübingen 2University of Cambridge 3University of Oxford
Pseudocode	Yes	Here, we provide pythonic-pseudo code for the constituent algorithms of Sort & Search, which we described in detail in Section 3.
Open Source Code	Yes	https://github.com/bethgelab/sort-and-search
Open Datasets	Yes	For Lifelong-CIFAR10, we use 31, 250 CIFAR-10 pre-trained models from the NATS-Bench-Topology-search space [25]. For Lifelong-Image Net, we use 167 Image Net-1K and Image Net-21K pre-trained models, sourced primarily from timm [98] and imagenet-testbed [84].
Dataset Splits	No	The paper describes splits for evaluating its framework's performance (e.g., 'Sample Addition Split (1 insert D)', 'Model Evaluation Split (2 insert M)'), but it does not provide traditional training/validation/test dataset splits for model training, as it primarily evaluates pre-trained models.
Hardware Specification	Yes	reducing compute cost from 180 GPU days to 5 GPU hours ( 1000x reduction) on a single A100 GPU
Software Dependencies	No	The paper mentions software like 'timm [98]' and 'Pythonic Pseudo-code', but it does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We run our S&S over 13 different sampling budgets: {8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768} on both Lifelong-Image Net and Lifelong-CIFAR10.