Cascading Reinforcement Learning

Authors: Yihan Du, R. Srikant, Wei Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.
Researcher Affiliation Collaboration Yihan Du Electrical and Computer Engineering University of Illinois Urbana-Champaign Urbana, IL 61801, USA yihandu@illinois.edu R. Srikant Electrical and Computer Engineering University of Illinois Urbana-Champaign Urbana, IL 61801, USA rsrikant@illinois.edu Wei Chen Microsoft Research Beijing 100080, China weic@microsoft.com
Pseudocode Yes Algorithm 1: Best Perm: find argmax A∈A f(A, u, w) and max A∈A f(A, u, w) ... Algorithm 2: Cascading VI ... Algorithm 3: Cascading BPI
Open Source Code No No explicit statement or link to open-source code for the described methodology was found.
Open Datasets Yes In this section, we present experimental results on a real-world dataset Movie Lens (Harper & Konstan, 2015), which contains millions of ratings for movies by users.
Dataset Splits No The paper does not explicitly provide details on training/test/validation dataset splits (e.g., percentages, counts, or specific predefined splits with citations).
Hardware Specification No The paper does not provide specific details about the hardware used for the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes We set δ = 0.005, K = 100000, H = 3, m = 3, S = 20, N ∈ {10, 15, 20, 25} and |A| ∈ {820, 2955, 7240, 14425}. We defer the detailed setup and more results to Appendix A. ... We set δ = 0.005, H = 5, S = 9 and m = 3. Each algorithm is performed for 20 independent runs. In the regret minimization setting, we let N ∈ {4, 8} and K = 10000, and show the average cumulative regrets and average running times (in the legend) across runs. In the best policy identification setting, we set ϵ = 0.5 and N ∈ {4, 5, 6, 7, 8}, and plot the average sample complexities and average running times across runs with 95% confidence intervals.