reproducibilityindex.ai

Beyond Confidence: Reliable Models Should Also Consider Atypicality

Authors: Mert Yuksekgonul, Linjun Zhang, James Y. Zou, Carlos Guestrin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental Setup: We investigate three classification settings across a range of datasets: 1. Balanced Supervised Classification: We use Res Net18-50-152 [HZRS16], Wide Res Net28 [ZK16], Ro BERTa [LOG+19] trained on Image Net [DDS+09], CIFAR10,100 [Kri09], MNLI [WNB18] respectively. 2. Imbalanced Supervised Classification: We use Res Net18, Res Next50, Res Net152 trained on CIFAR-LT, Image Net-LT and Places365-LT where models and data are mostly from [ZCLJ21, MKS+20]. 3. Classification with LLMs: We use open-source Alpaca7B [TGZ+23] on IMDB [MDP+11], TREC [LR02], and AG News [ZZL15] datasets with the prompts from [ZWF+21].
Researcher Affiliation	Academia	Mert Yuksekgonul Stanford University merty@stanford.edu Linjun Zhang Rutgers University lz412@stat.rutgers.edu James Zou Stanford University jamesz@stanford.edu Carlos Ernesto Guestrin Stanford University, CZ Biohub guestrin@stanford.edu
Pseudocode	No	The paper describes methods in prose and mathematical formulations but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/mertyg/beyond-confidence-atypicality
Open Datasets	Yes	Experimental Setup: We investigate three classification settings across a range of datasets: 1. Balanced Supervised Classification: We use Res Net18-50-152 [HZRS16], Wide Res Net28 [ZK16], Ro BERTa [LOG+19] trained on Image Net [DDS+09], CIFAR10,100 [Kri09], MNLI [WNB18] respectively. and Fitzpatrick17k [GHS+21] is a dataset of clinical images with Fitzpatrick skin tone annotations between 1-to-6
Dataset Splits	Yes	In the experiments, we randomly split the test sets into two equal halves to have a calibration split and a test split, and repeat the experiments over 10 random seeds. and We use the validation splits of these datasets as the calibration set, and report the results on the test set. and We split the dataset into 3 sets (Training (0.5), Validation (0.25), and Test (0.25)).
Hardware Specification	Yes	Our experiments were run on a single NVIDIA A100-80GB GPU.
Software Dependencies	No	The paper mentions software like Py Torch, Huggingface Datasets, Transformers Library, and Torchvision but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We train the models for 50 epochs, fixing the backbone and training only the probe on top of the penultimate layer. The probe consists of 2 layers, one layer of 256 units followed by ReLU and Dropout with probability 0.4, followed by the classifier layer with an output dimensionality of 9. We use an Adam optimizer with a 0.0001 learning rate. and We use 0.1 learning rate and 3000 maximum iterations across all experiments and initialize the temperature value as 1