Testing Semantic Importance via Betting
Authors: Jacopo Teneggi, Jeremias Sulam
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase the effectiveness and flexibility of our framework on synthetic datasets as well as on image classification using several vision-language models. |
| Researcher Affiliation | Academia | Jacopo Teneggi Johns Hopkins University jtenegg1@jhu.edu Jeremias Sulam Johns Hopkins University jsulam1@jhu.edu |
| Pseudocode | Yes | Algorithm 1 Level-α C-SKIT for concept j, Algorithm 2 Level-α X-SKIT for concept j |
| Open Source Code | Yes | Code to reproduce all experiments is available at https://github.com/Sulam-Group/IBYDMT. |
| Open Datasets | Yes | Animal with Attributes 2 (Aw A2) [82], CUB-200-2011 (CUB) [77], and the Imagenette subset of Image Net [22]. |
| Dataset Splits | Yes | We sample a training dataset of 50,000 images and train a Res Net18 [30]... To evaluate the model, we round predictions to the nearest integer and compute accuracy on a held-out set of 10,000 images from the same distribution (we use the original train and test splits of the MNIST dataset to guarantee no digits showed during training are included in test images)... |
| Hardware Specification | Yes | All experiments were run on a private server with one 24 GB NVIDIA RTX A5000 GPU and 96 CPU cores with 500 GB of RAM memory. |
| Software Dependencies | No | The paper mentions "Res Net18 [30]" and "Adam optimizer [35]" but does not provide specific software library versions (e.g., PyTorch 1.x, TensorFlow 2.x) that these components were implemented in, which are necessary for reproducible software dependencies. |
| Experiment Setup | Yes | For each test, we estimate the rejection rate (i.e., how often a test rejects), and the expected rejection time (i.e., how many steps of the test it takes to reject) over 100 draws of τ max = 1000 samples, and with a significance level α = 0.05. |