Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reinforcement Learning with Segment Feedback

Authors: Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, R. Srikant

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical and experimental results show that: under binary feedback, increasing the number of segments m decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing m does not reduce the regret significantly. ... We also present experiments to validate our theoretical results. ... 5. Experiments: Below we present experiments for RL with segment feedback to validate our theoretical results.
Researcher Affiliation	Collaboration	1University of Illinois at Urbana-Champaign 2Stanford University 3NVIDIA Research 4Technion. Correspondence to: Yihan Du <EMAIL>, R. Srikant <EMAIL>.
Pseudocode	Yes	Algorithm 1 Seg Bi TS ... Algorithm 2 E-Lin UCB ... Algorithm 3 Seg Bi TS-Tran ... Algorithm 4 Lin UCB-Tran
Open Source Code	No	The paper does not contain any explicit statement about releasing source code, nor does it provide any links to a code repository in the main text, acknowledgements, or supplementary materials.
Open Datasets	No	The paper describes the construction of custom MDP instances for its experiments, rather than using or providing access to pre-existing public datasets. For example: "For the binary segment feedback setting, we consider an MDP as in Figure 2(a): There are 9 states and 5 actions. For any a A, we have r(s0, a) = 0, r(si, a) = rmax for any i {1, 3, 5, 7} (called good states), and r(si, a) = rmax for any i {2, 4, 6, 8} (called bad states)."
Dataset Splits	No	The paper describes experiments conducted on custom-designed Markov Decision Processes (MDPs). These environments are defined by states, actions, rewards, and transitions, and the experiments involve simulating agent interactions within them. The concept of splitting a fixed dataset into training, validation, and test sets is not applicable here as data is generated through interaction.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, or cloud computing specifications.
Software Dependencies	No	The paper does not mention any specific software dependencies, libraries, or their version numbers used for implementing the algorithms or running the experiments.
Experiment Setup	Yes	In both settings, we set rmax = 0.5, δ = 0.005, H = 100 and m {1, 2, 4, 5, 10, 20, 25, 50, 100}. For each algorithm, we perform 20 independent runs, and plot the average cumulative regret up to episode K across runs with a 95% confidence interval.