Pausing Policy Learning in Non-stationary Reinforcement Learning

Authors: Hyunin Lee, Ming Jin, Javad Lavaei, Somayeh Sojoudi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental evaluations on three different environments also reveal that a nonzero policy hold duration yields higher rewards compared to continuous decision updates.
Researcher Affiliation Academia 1University of California, Berkeley 2Virginia Tech.
Pseudocode Yes Algorithm 1 Forecasting Online Reinforcement Learning
Open Source Code No The paper lists existing official codebases used for comparison (e.g., 'Official codes distributed from https://github.com/pranz24/pytorch-soft-actor-critic') but does not state that the authors are providing open-source code for their own proposed methodology (FSAC).
Open Datasets No The paper describes custom environments like the 'switching goal cliffworld' and modified 'Mujoco environments'. While Mujoco is a well-known simulator, the specific non-stationary modifications (vd(t) = a sin(wt)) mean the exact dataset/environment setup is not publicly linked or formally cited in a way that allows direct access to the *specific* data used without custom implementation.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into train, validation, and test sets.
Hardware Specification Yes All experiments are conducted on 12 Intel Xeon CPU E5-2690 v4 and 2 Tesla V100 GPUs.
Software Dependencies No The paper lists software libraries used ('Pytorch', 'Open AI Gym', 'Numpy') but does not provide specific version numbers for these dependencies, which is required for reproducibility.
Experiment Setup Yes For our experiments, we varied hyperparameters such as learning rates λπ {0.0001, 0.0003, 0.0005, 0.0007}, soft update parameters τs {0.001, 0.005, 0.003} and the entropy regularization parameters {0.01, 0.03, 0.1} and also experimented with different prediction lengths lf {5, 15, 20}.