Testing Semantic Importance via Betting

Authors: Jacopo Teneggi, Jeremias Sulam

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase the effectiveness and flexibility of our framework on synthetic datasets as well as on image classification using several vision-language models.
Researcher Affiliation Academia Jacopo Teneggi Johns Hopkins University jtenegg1@jhu.edu Jeremias Sulam Johns Hopkins University jsulam1@jhu.edu
Pseudocode Yes Algorithm 1 Level-α C-SKIT for concept j, Algorithm 2 Level-α X-SKIT for concept j
Open Source Code Yes Code to reproduce all experiments is available at https://github.com/Sulam-Group/IBYDMT.
Open Datasets Yes Animal with Attributes 2 (Aw A2) [82], CUB-200-2011 (CUB) [77], and the Imagenette subset of Image Net [22].
Dataset Splits Yes We sample a training dataset of 50,000 images and train a Res Net18 [30]... To evaluate the model, we round predictions to the nearest integer and compute accuracy on a held-out set of 10,000 images from the same distribution (we use the original train and test splits of the MNIST dataset to guarantee no digits showed during training are included in test images)...
Hardware Specification Yes All experiments were run on a private server with one 24 GB NVIDIA RTX A5000 GPU and 96 CPU cores with 500 GB of RAM memory.
Software Dependencies No The paper mentions "Res Net18 [30]" and "Adam optimizer [35]" but does not provide specific software library versions (e.g., PyTorch 1.x, TensorFlow 2.x) that these components were implemented in, which are necessary for reproducible software dependencies.
Experiment Setup Yes For each test, we estimate the rejection rate (i.e., how often a test rejects), and the expected rejection time (i.e., how many steps of the test it takes to reject) over 100 draws of τ max = 1000 samples, and with a significance level α = 0.05.