Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SureMap: Simultaneous mean estimation for single-task and multi-task disaggregated evaluation
Authors: Misha Khodak, Lester Mackey, Alexandra Chouldechova, Miro Dudik
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Sure Map on disaggregated evaluation tasks in multiple domains, observing significant accuracy improvements over several strong competitors. |
| Researcher Affiliation | Collaboration | Mikhail Khodak Princeton University EMAIL Lester Mackey, Alexandra Chouldechova, Miroslav Dudík Microsoft Research EMAIL |
| Pseudocode | Yes | Algorithm 1: Single-task Sure Map. (For multi-task Sure Map see D.) |
| Open Source Code | Yes | Code for both generating the task data and reproducing the method evaluations is available at https://github.com/mkhodak/Sure Map. |
| Open Datasets | Yes | Diabetes. This is a tabular dataset of Strack et al. [2014]... Adult. We use the classic Adult census dataset [Kohavi, 1996]... State-Level ACS (SLACS). ... assembled by Ding et al. [2021]... Common Voice (CV) dataset [Ardila et al., 2020] |
| Dataset Splits | Yes | Common Voice. This is a single-task dataset obtained by combining the validation and test partitions of the CV dataset. |
| Hardware Specification | Yes | By far the most computation was required to generate the Common Voice, CVC, and Adult tasks, which was done on a machine with two RTX-8000 GPUs and took about a week. |
| Software Dependencies | No | The paper mentions software like 'Whisper ASR model', 'llama-3-70b', and 'Sci Py implementation' but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | Our main metric is MAE relative to a ground truth vector, which we take to be the mean of all available data for each subpopulation g [d], except those with fewer than 40 samples. In our main results we subsample with replacement from the entire dataset at different rates and track performance as a function of the sizes of the resulting datasets. To obtain 95% confidence intervals we conduct 200 and 40 random trials at each subsampling rate in the single-task and multi-task settings, respectively. |