Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Principled Path to Fitted Distributional Evaluation

Authors: Sungee Hong, Jiayi Wang, Zhengling Qi, Raymond K. W. Wong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods.
Researcher Affiliation	Academia	Sungee Hong Texas A&M University Jiayi Wang University of Texas at Dallas Zhengling Qi George Washington University Raymond K. W. Wong Texas A&M University
Pseudocode	Yes	Algorithm 1 Fitted distributional evaluation Input: Functional Bregman divergence d (Tables 1 2), Model M, Initial Υ0 M, Data D Output: ΥT for t = 1 to T do Perform the minimization (5) to obtain Υt. end for
Open Source Code	Yes	3Codes are available at https://github.com/hse1223/Fitted-Distributional-Evaluation.git
Open Datasets	Yes	Atari games are common testbeds for distributional RL methods [e.g., 5, 15, 40, 52, 65]. ... The data that we used are either simulated on our own, or already made public by Open AI.
Dataset Splits	Yes	We applied the same splitting rule of data D = T t=1Dt by selecting the following T. ... 3. Allocate the same number of samples into each of the T subdatasets, and then put all the remaining samples to the last one DT.
Hardware Specification	No	Our simulations are extensive. We used high performance computing system. For LQR simulation (Appendix E.1), we have used CPU, 2GB memory. A single run of a single method under a fixed setting takes approximately 20 seconds. For Atari games simulation (Appendix E.2), we have used GPU, 10GB memory. A single run of a single method under a fixed setting takes approximately 700 seconds.
Software Dependencies	No	After acceptance of the paper, we disclosed the URL to the github repository that contains our Python codes (that we have created) which are used for simulations.
Experiment Setup	Yes	Our goal is to estimate Υπ PS A. Here, the target policy is π = π ϵtar, which is the epsilon-greedy variant (e.g., see Section 2.2 of [53]) of DQN-trained policy π [35] with ϵtar = 0.3. The behavior policy is ϵbeh with ϵbeh = 0.4, 0.8 (but ϵbeh = 0.4, 0.5 for Enduro and Pong to prevent sparse observation of rewards). ... In all simulations, we applied FDE methods based on Algorithm 1 with T = 50. Minimization of (5) of Algorithm 1 is done stochastically by Adam. In each t-th iteration (t = 1, , T), we ran 1000 stochastic gradient updates, each based on a batch of 32 randomly selected samples.