Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Principled Path to Fitted Distributional Evaluation
Authors: Sungee Hong, Jiayi Wang, Zhengling Qi, Raymond K. W. Wong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods. |
| Researcher Affiliation | Academia | Sungee Hong Texas A&M University Jiayi Wang University of Texas at Dallas Zhengling Qi George Washington University Raymond K. W. Wong Texas A&M University |
| Pseudocode | Yes | Algorithm 1 Fitted distributional evaluation Input: Functional Bregman divergence d (Tables 1 2), Model M, Initial Υ0 M, Data D Output: ΥT for t = 1 to T do Perform the minimization (5) to obtain Υt. end for |
| Open Source Code | Yes | 3Codes are available at https://github.com/hse1223/Fitted-Distributional-Evaluation.git |
| Open Datasets | Yes | Atari games are common testbeds for distributional RL methods [e.g., 5, 15, 40, 52, 65]. ... The data that we used are either simulated on our own, or already made public by Open AI. |
| Dataset Splits | Yes | We applied the same splitting rule of data D = T t=1Dt by selecting the following T. ... 3. Allocate the same number of samples into each of the T subdatasets, and then put all the remaining samples to the last one DT. |
| Hardware Specification | No | Our simulations are extensive. We used high performance computing system. For LQR simulation (Appendix E.1), we have used CPU, 2GB memory. A single run of a single method under a fixed setting takes approximately 20 seconds. For Atari games simulation (Appendix E.2), we have used GPU, 10GB memory. A single run of a single method under a fixed setting takes approximately 700 seconds. |
| Software Dependencies | No | After acceptance of the paper, we disclosed the URL to the github repository that contains our Python codes (that we have created) which are used for simulations. |
| Experiment Setup | Yes | Our goal is to estimate Υπ PS A. Here, the target policy is π = π ϵtar, which is the epsilon-greedy variant (e.g., see Section 2.2 of [53]) of DQN-trained policy π [35] with ϵtar = 0.3. The behavior policy is ϵbeh with ϵbeh = 0.4, 0.8 (but ϵbeh = 0.4, 0.5 for Enduro and Pong to prevent sparse observation of rewards). ... In all simulations, we applied FDE methods based on Algorithm 1 with T = 50. Minimization of (5) of Algorithm 1 is done stochastically by Adam. In each t-th iteration (t = 1, , T), we ran 1000 stochastic gradient updates, each based on a batch of 32 randomly selected samples. |