Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reinforcement Learning with Segment Feedback
Authors: Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, R. Srikant
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical and experimental results show that: under binary feedback, increasing the number of segments m decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing m does not reduce the regret significantly. ... We also present experiments to validate our theoretical results. ... 5. Experiments: Below we present experiments for RL with segment feedback to validate our theoretical results. |
| Researcher Affiliation | Collaboration | 1University of Illinois at Urbana-Champaign 2Stanford University 3NVIDIA Research 4Technion. Correspondence to: Yihan Du <EMAIL>, R. Srikant <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Seg Bi TS ... Algorithm 2 E-Lin UCB ... Algorithm 3 Seg Bi TS-Tran ... Algorithm 4 Lin UCB-Tran |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide any links to a code repository in the main text, acknowledgements, or supplementary materials. |
| Open Datasets | No | The paper describes the construction of custom MDP instances for its experiments, rather than using or providing access to pre-existing public datasets. For example: "For the binary segment feedback setting, we consider an MDP as in Figure 2(a): There are 9 states and 5 actions. For any a A, we have r(s0, a) = 0, r(si, a) = rmax for any i {1, 3, 5, 7} (called good states), and r(si, a) = rmax for any i {2, 4, 6, 8} (called bad states)." |
| Dataset Splits | No | The paper describes experiments conducted on custom-designed Markov Decision Processes (MDPs). These environments are defined by states, actions, rewards, and transitions, and the experiments involve simulating agent interactions within them. The concept of splitting a fixed dataset into training, validation, and test sets is not applicable here as data is generated through interaction. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, or cloud computing specifications. |
| Software Dependencies | No | The paper does not mention any specific software dependencies, libraries, or their version numbers used for implementing the algorithms or running the experiments. |
| Experiment Setup | Yes | In both settings, we set rmax = 0.5, δ = 0.005, H = 100 and m {1, 2, 4, 5, 10, 20, 25, 50, 100}. For each algorithm, we perform 20 independent runs, and plot the average cumulative regret up to episode K across runs with a 95% confidence interval. |