reproducibilityindex.ai

Testing Semantic Importance via Betting

Authors: Jacopo Teneggi, Jeremias Sulam

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We showcase the effectiveness and flexibility of our framework on synthetic datasets as well as on image classification using several vision-language models.
Researcher Affiliation	Academia	Jacopo Teneggi Johns Hopkins University jtenegg1@jhu.edu Jeremias Sulam Johns Hopkins University jsulam1@jhu.edu
Pseudocode	Yes	Algorithm 1 Level-α C-SKIT for concept j, Algorithm 2 Level-α X-SKIT for concept j
Open Source Code	Yes	Code to reproduce all experiments is available at https://github.com/Sulam-Group/IBYDMT.
Open Datasets	Yes	Animal with Attributes 2 (Aw A2) [82], CUB-200-2011 (CUB) [77], and the Imagenette subset of Image Net [22].
Dataset Splits	Yes	We sample a training dataset of 50,000 images and train a Res Net18 [30]... To evaluate the model, we round predictions to the nearest integer and compute accuracy on a held-out set of 10,000 images from the same distribution (we use the original train and test splits of the MNIST dataset to guarantee no digits showed during training are included in test images)...
Hardware Specification	Yes	All experiments were run on a private server with one 24 GB NVIDIA RTX A5000 GPU and 96 CPU cores with 500 GB of RAM memory.
Software Dependencies	No	The paper mentions "Res Net18 [30]" and "Adam optimizer [35]" but does not provide specific software library versions (e.g., PyTorch 1.x, TensorFlow 2.x) that these components were implemented in, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	For each test, we estimate the rejection rate (i.e., how often a test rejects), and the expected rejection time (i.e., how many steps of the test it takes to reject) over 100 draws of τ max = 1000 samples, and with a significance level α = 0.05.