Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Estimating Uncertainty Online Against an Adversary
Authors: Volodymyr Kuleshov, Stefano Ermon
AAAI 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We establish formal guarantees for our methods, and we validate them on two real-world problems: question answering and medical diagnosis from genomic data. We now proceed to study Algorithm 1 empirically. |
| Researcher Affiliation | Academia | Volodymyr Kuleshov Stanford University Stanford, CA 94305 EMAIL Stefano Ermon Stanford University Stanford, CA 94305 EMAIL |
| Pseudocode | Yes | Algorithm 1 Online Recalibration Require: Online calibration subroutine F cal and number of buckets M 1: Let I = {[0, 1 M ), ..., [ M 1 M , 1]} be a set of intervals that partition [0, 1]. 2: Let F = {F cal j | j = 0, ..., M 1} be a set of M independent instances of F cal. 3: for t = 1, 2, ...: do 4: Observe uncalibrated forecast p F t . 5: Let Ij I be the interval containing p F t . 6: Let pt be the forecast of F cal j . 7: Output pt. Observe yt and pass it to F cal j . |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | Yes | Natural language understanding. We used Algorithm 1 to recalibrate a state-of-the-art question answering system (Berant and Liang 2014) on the popular Free917 dataset (641 training, 276 testing examples). Medical diagnosis. Our last task is predicting the risk of type 1 diabetes from genomic data. We use genotypes of 3,443 subjects (1,963 cases, 1,480 controls) over 447,221 SNPs (The Wellcome Trust Case Control Consortium 2007) |
| Dataset Splits | Yes | Natural language understanding. We used Algorithm 1 to recalibrate a state-of-the-art question answering system (Berant and Liang 2014) on the popular Free917 dataset (641 training, 276 testing examples). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as CPU/GPU models or memory specifications. |
| Software Dependencies | No | The paper mentions using a 'linear support vector machine (SVM)' but does not provide specific software names with version numbers for its implementation or other dependencies. |
| Experiment Setup | Yes | We used an online ℓ1-regularized linear support vector machine (SVM) to predict outcomes one patient at a time, and report performance for each t [T]. Uncalibrated probabilities are normalized raw SVM scores st, i.e. p F t = (st + mt)/2mt, where mt = max1 r t |sr|. |