Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Online Feedback Efficient Active Target Discovery in Partially Observable Environments

Authors: Anindya Sarkar, Binglin Ji, Yevgeniy Vorobeychik

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments and ablation studies across diverse domains, including medical imaging, species discovery and remote sensing, we show that Diff ATD performs significantly better than baselines and competitively with supervised methods that operate under full environmental observability.
Researcher Affiliation	Academia	Anindya Sarkar , Binglin Ji , Yevgeniy Vorobeychik EMAIL, Department of Computer Science and Engineering Washington University in St. Louis, USA
Pseudocode	Yes	Algorithm 1 Diffusion Dynamics Guided by Measurements
Open Source Code	Yes	Our code and models are publicly available at this link. Our training and inference code will be made public.
Open Datasets	Yes	We consider various targets (e.g., truck, plane) from the DOTA dataset [29] and present the results in Table 1. We evaluate the efficacy of Diff ATD in uncovering an unknown species from i Naturalist [30]. To this end, we compare the performance of Diff ATD with the baselines in terms of SR using target classes from the skin imaging dataset [31]. Next, we evaluate Diff ATD on the Chest X-Ray dataset [32]. We compare the exploration strategy of Diff ATD with that of Greedy-Adaptive (GA) using Celeb A samples. Here, we evaluate performance by comparing Diff ATD with the baselines using the SR metric. We compare the performance across varying measurement budgets B. The results are presented in Table 9. We observe significant improvements in the performance of the proposed Diff ATD approach compared to all baselines in each measurement budget setting, ranging from 16.30% to 45.23% improvement relative to the most competitive method. These empirical results are consistent with those from other datasets we explored, such as DOTA, and Celeb A.
Dataset Splits	No	With objective 1 in focus, we aim to develop a search policy that efficiently explores the search area (xtest Xtest) to identify as many target regions as possible within a measurement budget B, and to achieve this by utilizing a pre-trained diffusion model, trained in an unsupervised manner on samples xtrain from the training set, Xtrain. The paper mentions training and test sets but does not specify the exact percentages or counts for training, validation, or test splits for any of the datasets used.
Hardware Specification	Yes	Finally, all experiments are implemented in Tensorflow and conducted on NVIDIA A100 40G GPUs.
Software Dependencies	No	Finally, all experiments are implemented in Tensorflow and conducted on NVIDIA A100 40G GPUs. The paper mentions Tensorflow but does not specify its version number or any other software dependencies with version numbers.
Experiment Setup	Yes	This section provides the training and inference hyperparameters for each dataset used in our experiments. We use DDIM [35] as the diffusion model across datasets. The diffusion models used in different experiments are based on widely adopted U-Net-style architecture. For the MNIST dataset, we use 32-dimensional diffusion time-step embeddings, with the diffusion model consisting of 2 residual blocks. We select the time-step embedding vector dimension to match the input feature size, ensuring the diffusion model can process it efficiently. The block widths are set to [32, 64, 128], and training involves 30 diffusion steps. DOTA, Celeb A, and Skin imaging datasets share the same input feature size of [128, 128, 3] and architecture, featuring 128-dimensional time-step embeddings and a diffusion model with 2 residual blocks of width [64, 128, 256, 256, 512]. For these datasets, we perform training with 100 diffusion steps. We use 128-dimensional time-step embeddings for the Bone dataset and a diffusion model with 2 residual blocks (each block width: [64, 128, 256, 256]). We use 100 diffusion steps during training. We set the learning rate and weight decay factor to 1e 4 for all experimental settings. We set a measurement schedule (M) of 100 for a measurement budget (B) of 200, ensuring that B T M , where T is the total number of reverse diffusion steps, set to approximately 2000 for the DOTA, Celeb A, and Skin datasets. Our proposed method, Diff ATD, utilizes a parameterized reward model, rϕ, to steer the exploitation process. To this end, we employ a neural network consisting of two fully connected layers, with non-linear Re LU activations as the reward model (rϕ). The reward model s goal is to predict a score ranging from 0 to 1, where a higher score indicates a higher likelihood that the measurement location corresponds to the target, based on its semantic features. Note that the size of the input semantic feature map for a given measurement location can vary depending on the downstream task. For instance, when working with the MNIST dataset, we use a 1 1 pixel as the input feature, while for other datasets like Celeb A, DOTA, Bone, and Skin imaging, we use an 4 4 patch as the input feature size. After each measurement step, we update the model parameters (ϕ) using the objective function outlined in Equation 8. Additionally, the training dataset is updated with the newly observed data point, refining the model s predictions over time.