A General Framework for Sequential Decision-Making under Adaptivity Constraints

Authors: Nuoya Xiong, Zhaoran Wang, Zhuoran Yang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimented in the linear mixture MDP with the same setting as Appendix H in (Chen et al., 2022). We compare our ℓ2-EC-RS algorithm with OPERA (Chen et al., 2022), optimal policy and the random policy. The cumulative reward curves show that our algorithm converges to the optimal value slightly slower than OPERA. However, the average number of strategy transitions and calls to the optimization tool decreases from 2000 to 92.8 times over 10 simulations, decreasing the average execution time from 321.6 seconds to 15.9 seconds. We execute two algorithms on three different tasks Hopper-v3 , Half Cheetah-v2 and Walker2d-v3 for 100000 episodes, and the setting is the same as Section 7 of (Liu et al., 2024). The comparisons of the rewards and the number of policy switches are shown in Figure 2 and Table 2.
Researcher Affiliation Academia 1IIIS, Tsinghua University, China 2Department of Industrial Engineering and Management Sciences, Northwestern University, USA 3Department of Statistics and Data Science, Yale University, USA.
Pseudocode Yes Algorithm 1 ℓ2-EC-RS; Algorithm 2 ℓ2-EC-Batch; Algorithm 3 Modified ℓ1 ABC-Rare switch; Algorithm 4 ℓ2-EC-Adaptive Batch
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for their proposed methods is open-source or publicly available.
Open Datasets Yes We experimented in the linear mixture MDP with the same setting as Appendix H in (Chen et al., 2022). We execute two algorithms on three different tasks Hopper-v3 , Half Cheetah-v2 and Walker2d-v3 for 100000 episodes, and the setting is the same as Section 7 of (Liu et al., 2024).
Dataset Splits No The paper mentions total episodes (K or T) and some experimental parameters but does not provide specific percentages or counts for training, validation, or test dataset splits.
Hardware Specification No The paper does not explicitly describe any specific hardware components (e.g., GPU models, CPU types, memory specifications) used for running its experiments.
Software Dependencies No The paper does not list any specific software dependencies with version numbers.
Experiment Setup Yes We choose T = 2000 and β = 0.3 log T in the experiment. We execute two algorithms on three different tasks Hopper-v3 , Half Cheetah-v2 and Walker2d-v3 for 100000 episodes, and the setting is the same as Section 7 of (Liu et al., 2024).