SureMap: Simultaneous mean estimation for single-task and multi-task disaggregated evaluation
Authors: Misha Khodak, Lester Mackey, Alexandra Chouldechova, Miro Dudik
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Sure Map on disaggregated evaluation tasks in multiple domains, observing significant accuracy improvements over several strong competitors. |
| Researcher Affiliation | Collaboration | Mikhail Khodak Princeton University mkhodak@cs.cmu.edu Lester Mackey, Alexandra Chouldechova, Miroslav DudÃk Microsoft Research {lmackey,alexandrac,mdudik}@microsoft.com |
| Pseudocode | Yes | Algorithm 1: Single-task Sure Map. (For multi-task Sure Map see D.) |
| Open Source Code | Yes | Code for both generating the task data and reproducing the method evaluations is available at https://github.com/mkhodak/Sure Map. |
| Open Datasets | Yes | Diabetes. This is a tabular dataset of Strack et al. [2014]... Adult. We use the classic Adult census dataset [Kohavi, 1996]... State-Level ACS (SLACS). ... assembled by Ding et al. [2021]... Common Voice (CV) dataset [Ardila et al., 2020] |
| Dataset Splits | Yes | Common Voice. This is a single-task dataset obtained by combining the validation and test partitions of the CV dataset. |
| Hardware Specification | Yes | By far the most computation was required to generate the Common Voice, CVC, and Adult tasks, which was done on a machine with two RTX-8000 GPUs and took about a week. |
| Software Dependencies | No | The paper mentions software like 'Whisper ASR model', 'llama-3-70b', and 'Sci Py implementation' but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | Our main metric is MAE relative to a ground truth vector, which we take to be the mean of all available data for each subpopulation g [d], except those with fewer than 40 samples. In our main results we subsample with replacement from the entire dataset at different rates and track performance as a function of the sizes of the resulting datasets. To obtain 95% confidence intervals we conduct 200 and 40 random trials at each subsampling rate in the single-task and multi-task settings, respectively. |