Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Test Time Scaling for Neural Processes

Authors: Hyungi Lee, Moonseok Choi, Hyunsu Kim, Kyunghyun Cho, Rajesh Ranganath, Juho Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct a series of experiments to empirically validate the effectiveness of TTSNPs across a variety of settings, with a particular focus on regression tasks. We utilize two representative models from the latent NP family: simple NP [19], which is the earliest model in this line of work, and DANP [33], a recent model that has demonstrated strong performance and broad applicability across various tasks.
Researcher Affiliation	Collaboration	Hyungi Lee Kookmin University EMAIL Moonseok Choi KAIST EMAIL Hyunsu Kim KAIST EMAIL Kyunghyun Cho New York University&Genetech EMAIL Rajesh Ranganath New York University EMAIL Juho Lee KAIST EMAIL
Pseudocode	Yes	Algorithm 1 Multinomial Resampling Algorithm 2 Overall TTSNP inference algorithm
Open Source Code	Yes	To support reproducibility, we provide our full experimental code as part of the supplementary material. Our implementation is based on the official codebase1 of DANP [33], and all experiments were performed using Py Torch [3]. Training and evaluation were carried out on either an
Open Datasets	Yes	For our experiments, we constructed a modified version of the EMNIST Balanced dataset 2 [9], a widely used benchmark derived from the original NIST Special Database. For experiments involving natural image completion, we used the Celeb A dataset 3 [37], a large-scale face dataset commonly used in generative modeling benchmarks. To construct GP tasks for our experiments, we generate synthetic datasets using GPs equipped with one of three commonly used kernels: the RBF kernel, the Matern 5/2 kernel, and the RQ kernel.
Dataset Splits	Yes	From this selection, we sampled 24,000 images for training and 4,000 for testing. During training, for each episode, we randomly chose the number of context points \|c\| from a uniform distribution over [5, 45], and the number of target points \|t\| was drawn from Unif(5, 50 \|c\|), ensuring a total of at most 50 points per task. The dataset consists of 162,770 training images, 19,867 for validation, and 19,962 for testing. To simulate diverse conditioning scenarios, we sampled the number of context points \|c\| Unif(5, 45), and drew the number of target points \|t\| Unif(5, 50 \|c\|), ensuring that each training episode contains a variable and realistic number of observed and queried pixels. And for the number of context points \|c\|, we used the sampled number from the range Unif(5n2, 50n2 \|c\|) where n indicates the x dimension for the GP task. This quadratic scaling with n reflects the increased data requirements for higher-dimensional input spaces. Similarly, the number of target points \|t\| is sampled from Unif(5n2, 50n2 \|c\|), maintaining a fixed upper limit on the total number of points per task.
Hardware Specification	Yes	Training and evaluation were carried out on either an NVIDIA Ge Force RTX 3090 or an RTX A6000 GPU.
Software Dependencies	Yes	Our implementation is based on the official codebase1 of DANP [33], and all experiments were performed using Py Torch [3]. Training and evaluation were carried out on either an
Experiment Setup	Yes	Unless stated otherwise, we selected hyperparameters based on validation log-likelihood across tasks, using the following search spaces: learning rates from {5 10 5, 7 10 5, 9 10 5, 1 10 4, 3 10 4, 5 10 4}, weight decay values of {0, 1 10 5}, and batch sizes of {16, 32}. We optimized all models using the Adam optimizer [30], combined with a cosine annealing schedule for the learning rate. Unless otherwise specified, we fix the number of latent variable samples to 50 across both our method and all baselines to ensure fair comparison. Specifically, we compared performance across different numbers of SMC steps, rather than using the default setting of T = 10 from previous experiments.