Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Formalizing Preferences Over Runtime Distributions
Authors: Devon R. Graham, Kevin Leyton-Brown, Tim Roughgarden
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper aims to lay theoretical foundations for such choices by formalizing preferences over runtime distributions. ... Finally, in Section 5 we present some real-world examples where the choice of utility function really is important and changes our conclusions about which algorithm is considered best." and later in Section 5: "Algorithm Configuration. We considered a dataset due to Weisz et al. (2018) which evaluated 972 randomly-sampled configurations of the minisat (Sorensson & Een, 2005) SAT solver... Our results (Figure 3) show that these differences were significant in practice: we often lost a substantial fraction of the available utility when we optimized for the wrong utility function. International SAT Competition. Figure 4 shows the ranking of the Parallel Track of the 2021 International SAT Competition. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of British Columbia, Vancouver, BC 2Department of Computer Science, Columbia University, New York, New York 3a16z crypto. Correspondence to: Devon R. Graham <EMAIL>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code to reproduce all figures can be found at https://github.com/drgrhm/formalizing-preferences |
| Open Datasets | Yes | We considered a dataset due to Weisz et al. (2018) which evaluated 972 randomly-sampled configurations of the minisat (Sorensson & Een, 2005) SAT solver on 20118 instances generated by CNFuzz DD. |
| Dataset Splits | No | The paper mentions evaluating configurations on '20118 instances generated by CNFuzz DD' but does not specify any training, validation, or test splits for these instances. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions the 'minisat SAT solver' and 'CNFuzz DD' but does not provide specific version numbers for these or any other software dependencies used in the experiments. |
| Experiment Setup | No | The paper mentions evaluating 'randomly-sampled configurations' and analyzing results from the SAT Competition, but it does not provide specific experimental setup details such as hyperparameter values, training configurations, or system-level settings used for its own analysis. |