Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
OTTER: Effortless Label Distribution Adaptation of Zero-shot Models
Authors: Changho Shin, Jitian Zhao, Sonia Cromp, Harit Vishwakarma, Frederic Sala
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we validate our method in a wide array of zero-shot image and text classification tasks, improving accuracy by 4.8% and 15.9% on average, and beating baselines like prior matching often by significant margins in 17 out of 21 datasets. |
| Researcher Affiliation | Academia | Department of Computer Sciences University of Wisconsin-Madison EMAIL |
| Pseudocode | Yes | Algorithm 1 OTTER 1: Input: Input X = {x1, . . . , xn}, label distribution specification (p1, . . . , p K), cost matrix C Rn K 2: Define input marginal µ = 1 n 1, prediction marginal ν = (p1, . . . , p K) 3: Run optimal transport and obtain transport plan π s.t. π = arg minγ Π(µ,ν) γ, C . 4: Get modified classification outputs ˆyi = arg maxj [K] πi,j. Return {ˆyi}i [n] |
| Open Source Code | Yes | Our code is available at https://github.com/SprocketLab/OTTER. |
| Open Datasets | Yes | We used 17 image classification datasets and 4 text classification datasets. ... CIFAR10, CIFAR100 [33], Caltech101 [22], Caltech256 [25], Food101 [8], STL10 [16], SUN397 [67], Flower102 [42], Euro SAT [27], Oxford IIIT Pet [44], Stanford Cars [32], DTD [14], CUB [61], Image Net [18], Image Net-r [29], and Image Net-Sketch [63]. Zeroshot text classification datasets We use Amazon [41], Gender [20], Civil Comments [7], and Hate Xplain [39]. |
| Dataset Splits | Yes | We selected hyperparameters through grid search, by evaluating their performance on a validation set, consisting of 10 labeled examples per class. |
| Hardware Specification | Yes | Measurements were taken using a machine equipped with an Intel Core i7-11700K @ 3.60GHz processor, 64GB RAM, and NVIDIA GPU RTX-4090. |
| Software Dependencies | No | The paper mentions using CLIP [49] and BERT [19] models, but does not provide specific version numbers for these or any other software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | We selected hyperparameters through grid search, by evaluating their performance on a validation set, consisting of 10 labeled examples per class. ... Temperature: [1e-3, 1e-4, 1e-5, 1e-6, 1e-7] Learning rate: [1e-3, 1e-4, 1e-5, 1e-6, 1e-7] ... For zero-shot image classification, we emply CLIP [49] models. We used a photo of a [CLASS]' prompt. Scores are computed by sθ(xi, j) = exp (cos(f(xi), g(yj))/τ) PK j =1 exp (cos(f(xi), g(yj ))/τ) for image xi regarding the label j, given the image encoder f, the text encoder g. Cost matrix is constructed by C = [Cij]i [n],j [K], where cij = log sθ(xi, j). We run Algorithm 1 with the true class balance of the test dataset. |