Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AHA: Human-Assisted Out-of-Distribution Generalization and Detection
Authors: Haoyue Bai, Jifan Zhang, Robert Nowak
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate the efficacy of our framework. We observed that with only a few hundred human annotations, our method significantly outperforms existing state-of-the-art methods that do not involve human assistance, in both OOD generalization and OOD detection. ... Extensive experiments and ablation studies demonstrate the effectiveness of our human-assisted method. |
| Researcher Affiliation | Academia | Haoyue Bai, Jifan Zhang, Robert Nowak University of Wisconsin-Madison EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 AHA: Adaptive Human Assisted labeling for OOD learning |
| Open Source Code | Yes | Code is publicly available at https://github.com/Haoyue Bai ZJU/aha. |
| Open Datasets | Yes | Following the benchmark in literature of [6], we use the CIFAR10 [60] as Pin and CIFAR-10-C [45] with Gaussian additive noise as the Pcovariate out for our main experiments. ... For semantic OOD data (Psemantic out ), we utilize natural image datasets including SVHN [72], Textures [19], Places365 [113], LSUN-Crop [103], and LSUN-Resize [103]. Additionally, we provide results on the PACS dataset [64] from Domain Bed. |
| Dataset Splits | Yes | To compile the wild data, we divide the ID set into 50% labeled as ID (in-distribution) and 50% unlabeled. We then mix unlabeled ID, covariate OOD, and semantic OOD data for our experiments. ... Within the training/validation split, 70% of the data is used for training, and the remaining 30% is used for validation. |
| Hardware Specification | Yes | Experiments are performed using Tesla V100. |
| Software Dependencies | Yes | Our framework was implemented using Py Torch 2.0.1. |
| Experiment Setup | Yes | For CIFAR experiments, we adopt a Wide Res Net [104] with 40 layers and a widen factor of 2. For optimization, we use stochastic gradient descent with Nesterov momentum [27], including a weight decay of 0.0005 and a momentum of 0.09. The batch size is set to 128, and the initial learning rate is 0.1, with cosine learning rate decay. The model is initialized with a pre-trained network on CIFAR-10 and trained for 100 epochs using our objective from Equation 4, with α = 10. We set a default labeling budget k of 1000 for the benchmarking results and provide an analysis of different labeling budgets 100, 500, 1000, 2000 in Section 5.3. |