reproducibilityindex.ai

Cascading Reinforcement Learning

Authors: Yihan Du, R. Srikant, Wei Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.
Researcher Affiliation	Collaboration	Yihan Du Electrical and Computer Engineering University of Illinois Urbana-Champaign Urbana, IL 61801, USA yihandu@illinois.edu R. Srikant Electrical and Computer Engineering University of Illinois Urbana-Champaign Urbana, IL 61801, USA rsrikant@illinois.edu Wei Chen Microsoft Research Beijing 100080, China weic@microsoft.com
Pseudocode	Yes	Algorithm 1: Best Perm: find argmax A∈A f(A, u, w) and max A∈A f(A, u, w) ... Algorithm 2: Cascading VI ... Algorithm 3: Cascading BPI
Open Source Code	No	No explicit statement or link to open-source code for the described methodology was found.
Open Datasets	Yes	In this section, we present experimental results on a real-world dataset Movie Lens (Harper & Konstan, 2015), which contains millions of ratings for movies by users.
Dataset Splits	No	The paper does not explicitly provide details on training/test/validation dataset splits (e.g., percentages, counts, or specific predefined splits with citations).
Hardware Specification	No	The paper does not provide specific details about the hardware used for the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	We set δ = 0.005, K = 100000, H = 3, m = 3, S = 20, N ∈ {10, 15, 20, 25} and \|A\| ∈ {820, 2955, 7240, 14425}. We defer the detailed setup and more results to Appendix A. ... We set δ = 0.005, H = 5, S = 9 and m = 3. Each algorithm is performed for 20 independent runs. In the regret minimization setting, we let N ∈ {4, 8} and K = 10000, and show the average cumulative regrets and average running times (in the legend) across runs. In the best policy identification setting, we set ϵ = 0.5 and N ∈ {4, 5, 6, 7, 8}, and plot the average sample complexities and average running times across runs with 95% confidence intervals.