Tempo Adaptation in Non-stationary Reinforcement Learning
Authors: Hyunin Lee, Yuhao Ding, Jongmin Lee, Ming Jin, Javad Lavaei, Somayeh Sojoudi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental evaluation on various high-dimensional nonstationary environments shows that the Pro ST framework achieves a higher online return at suboptimal {t}1 K than the existing methods. We evaluate Pro ST-G with four baselines in three Mujoco environments each with five different non-stationary speeds and two non-stationary datasets. |
| Researcher Affiliation | Academia | 1UC Berkeley, Berkeley, CA 94709 2Virginia Tech, Blacksburg, VA 24061 |
| Pseudocode | Yes | We elaborate on the above procedure in Algorithm 1 given in Appendix F.1. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | No | The paper mentions using “three Mujoco environments” and “real data A and B” but does not provide specific links, DOIs, repository names, or formal citations with authors/years for public access to these environments or datasets. |
| Dataset Splits | No | The paper does not explicitly provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into training, validation, and test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the “soft actor-critic (SAC) algorithm [23]” but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | The experiments are performed over five different policy training times π {1,2,3,4,5}, aligned with SAC s number of gradient steps G {38,76,114,152,190}, under a fixed environment speed. We generate ok = sin(2π πk/37), which satisfies Assumption 1 (see Appendix E.1). The shaded areas of Figures 3 (a), (b) and (c) are 95 % confidence area among three different noise bounds of 0.01,0.02 and 0.03 in ok. We elaborate on the training details in Appendix E.2. |