Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Does It Pay to Optimize AUC?
Authors: Baojian Zhou, Steven Skiena
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that compared with other methods, AUC-opt achieves statistically significant improvements on between 17 to 40 in R2 and between 4 to 42 in R3 of 50 t-SNE training datasets. |
| Researcher Affiliation | Academia | 1Fudan University, Shanghai, China 2Stony Brook University, New York, USA |
| Pseudocode | Yes | Algorithm 1: [AUCopt, w] =AUC-opt(D) and Algorithm 2: [AUCopt, w] =AUC-opt(D, d) |
| Open Source Code | Yes | Our code can be found in https://github.com/baojian/auc-opt |
| Open Datasets | No | The paper states 'We collect 50 real-world datasets' but does not provide specific names, links, DOIs, or formal citations with author/year for public access to these datasets. |
| Dataset Splits | Yes | For each dataset, 50% samples are for training and the rest for testing. All parameters are tuned by 5-fold cross-validation. |
| Hardware Specification | Yes | All methods have been tested on servers with Intel(R) Xeon(R) CPU (2.30GHz) 64 cores and 187G memory. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., library names with explicit version details). |
| Experiment Setup | Yes | For each dataset, 50% samples are for training and the rest for testing. All parameters are tuned by 5-fold cross-validation. Each dataset is randomly shuffled 200 times, and the reported results are averaged on 200 trials. |