Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Active Measurement: Efficient Estimation at Scale

Authors: Max Hamilton, Jinlin Lai, Wenlong Zhao, Subhransu Maji, Daniel R. Sheldon

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically that our techniques provide accurate confidence intervals and reduce estimation error compared to prior methods on several scientific measurement tasks.
Researcher Affiliation	Academia	Max Hamilton Jinlin Lai Wenlong Zhao Subhransu Maji Daniel Sheldon Manning College of Information & Computer Sciences University of Massachusetts, Amherst EMAIL
Pseudocode	Yes	Algorithm 1 Active measurement Require: Initially labeled units D1 Ω, acquisition distribution q1, weight sequences ατ, βτ 1: for t = 1, 2, ..., T do 2: Sample st qt( ) and obtain f(st) 3: Form IS estimate ˆFt = F(Dt) + f(st) qt(st), st qt. (1) 4: Combine estimates as ˆF1:t = Pt τ=1 ατ ˆFτ 5: Get variance estimates {d Varτ}t τ=1 using Alg. 2 6: Combine variances as d Var1:t = Pt τ=1 α2 τ d Varτ 7: Update Dt+1 = Dt {st} 8: Update acquisition distribution qt+1 over Ω\ Dt+1, e.g., by updating an AI model using Dt+1 9: end for
Open Source Code	Yes	Code for this paper is available at: https://github.com/cvl-umass/active-measurement.
Open Datasets	Yes	The Malaria Cell dataset (image set BBBC041v1, available from the Broad Bioimage Benchmark Collection [24]) comprises 1,364 images (about 80,000 cells). For damaged building detection we focus on the Palu Tsunami subset of x BD [15]. The NEXRAD radar data that we use is open to the public and can be used as desired.
Dataset Splits	Yes	We divide the sky and reeds images into tiles of size 200 200 and 160 160 pixels, respectively, and manually annotate the birds in each tile using the VGG annotator [11]. This results in 925 tiles for sky and 1,426 tiles for reeds... The detector performs reasonably well, with average error rates of 9.5% and 33.1% when trained on 50 randomly selected tiles. To reduce overfitting, we use a mix of 80% pretraining data and 20% station-specific labeled data. For our initial model we finetune the default Faster R-CNN network on three randomly selected cell images. The initial model is trained on 5 randomly selected images.
Hardware Specification	Yes	The model is fine-tuned with a single A16 GPU for 400 iterations and a learning rate of 0.001 on the annotated tiles. Each fine-tuning run takes about 2 hours on a single A16 GPU.
Software Dependencies	No	To detect birds, we train a Faster R-CNN [35] detector with a Res Net-50 [17] backbone pre-trained on Image Net [9], using the Detectron2 [42] library. We use the Detectron2 library with modified configurations.
Experiment Setup	Yes	The model is fine-tuned with a single A16 GPU for 400 iterations and a learning rate of 0.001 on the annotated tiles. We adapt the detector to each station with learning rate 10 4 for 3000 iterations on a single A16 GPU. Specifically, we disable learning rate warmup, set the batch size to 8, and set FILTER_EMPTY_ANNOTATIONS to False. For the high-resolution image experiments, we use the faster_rcnn_R_50_FPN_3x.yaml configuration with the default Image Net-pretrained weights.