Cascading Reinforcement Learning
Authors: Yihan Du, R. Srikant, Wei Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice. |
| Researcher Affiliation | Collaboration | Yihan Du Electrical and Computer Engineering University of Illinois Urbana-Champaign Urbana, IL 61801, USA yihandu@illinois.edu R. Srikant Electrical and Computer Engineering University of Illinois Urbana-Champaign Urbana, IL 61801, USA rsrikant@illinois.edu Wei Chen Microsoft Research Beijing 100080, China weic@microsoft.com |
| Pseudocode | Yes | Algorithm 1: Best Perm: find argmax A∈A f(A, u, w) and max A∈A f(A, u, w) ... Algorithm 2: Cascading VI ... Algorithm 3: Cascading BPI |
| Open Source Code | No | No explicit statement or link to open-source code for the described methodology was found. |
| Open Datasets | Yes | In this section, we present experimental results on a real-world dataset Movie Lens (Harper & Konstan, 2015), which contains millions of ratings for movies by users. |
| Dataset Splits | No | The paper does not explicitly provide details on training/test/validation dataset splits (e.g., percentages, counts, or specific predefined splits with citations). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for the experiments (e.g., GPU/CPU models, memory specifications). |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We set δ = 0.005, K = 100000, H = 3, m = 3, S = 20, N ∈ {10, 15, 20, 25} and |A| ∈ {820, 2955, 7240, 14425}. We defer the detailed setup and more results to Appendix A. ... We set δ = 0.005, H = 5, S = 9 and m = 3. Each algorithm is performed for 20 independent runs. In the regret minimization setting, we let N ∈ {4, 8} and K = 10000, and show the average cumulative regrets and average running times (in the legend) across runs. In the best policy identification setting, we set ϵ = 0.5 and N ∈ {4, 5, 6, 7, 8}, and plot the average sample complexities and average running times across runs with 95% confidence intervals. |