Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Okapi: Generalising Better by Making Statistical Matches Match
Authors: Myles Bartlett, Sara Romiti, Viktoriia Sharmanska, Novi Quadrianto
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on the WILDS 2.0 datasets [63], which significantly expands the range of modalities, applications, and shifts available for studying and benchmarking real-world unsupervised adaptation. Our method outperforms the baseline methods in terms of out-of-distribution (OOD) generalisation on the i Wild Cam (a multi-class classification task) and Poverty Map (a regression task) image datasets as well as the Civil Comments (a binary classification task) text dataset. |
| Researcher Affiliation | Academia | Myles Bartlett1 Sara Romiti1 Viktoriia Sharmanska1,2 Novi Quadrianto1,3,4 1Predictive Analytics Lab, University of Sussex 2Imperial College London 3BCAM Severo Ochoa Strategic Lab on Trustworthy Machine Learning 4Monash University, Indonesia |
| Pseudocode | Yes | See Fig 3.2 for a pictorial representation of these steps and Appendix G for reference pseudocode. These steps are illustrated pictorially in Fig 2 and as pseudocode in Appendix G. |
| Open Source Code | Yes | Code for our paper is publicly available at https://github.com/wearepal/okapi/. |
| Open Datasets | Yes | We evaluate Okapi on three datasets taken from the WILDS 2.0 benchmark [63]. |
| Dataset Splits | Yes | Following [63], we compute the mean and standard deviation (shown in parentheses) over multiple runs for both ID and OOD test sets, with these runs conducted with 3 different random seeds and 5 pre-defined cross-validation folds for i Wild Cam and Poverty Map, respectively. We attribute this partly to the high variance of the model-selection procedure (inherited from [63]) based on intermittently-computed validation performance (which does not consistently align with test performance) to determine the final model. |
| Hardware Specification | No | No, however do provide estimates of the carbon footprint for a single run of our method and of the ERM and Fix Match baselines for the i Wild Cam dataset. |
| Software Dependencies | No | Pytorch: An imperative style, high-performance deep learning library. with all models trained with a pre-trained Distil BERT [64] backbone. with us opting for a Conv Ne Xt [47] architecture over a Res Net one. |
| Experiment Setup | Yes | Yes; all implementation details, including those related to optimisation and hyperparameter-selection, are given in Appendix D. |