Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation
Authors: Devavrat Shah, Dogyoon Song, Zhi Xu, Yuzhe Yang
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on several stochastic control tasks confirm the efficacy of our low-rank algorithms. |
| Researcher Affiliation | Academia | Devavrat Shah EECS, MIT devavrat@mit.edu Dogyoon Song EECS, MIT dgsong@mit.edu Zhi Xu EECS, MIT zhixu@mit.edu EECS, MIT yuzhe@mit.edu |
| Pseudocode | Yes | We provide a narrative overview of the algorithm; the pseudo-code can be found in Appendix A. |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper mentions using 'several stochastic control tasks' and that they 'first discretize the spaces into very fine grid and run standard value iteration to obtain a proxy of Q'. However, it does not provide concrete access information (links, DOIs, formal citations) to publicly available datasets used for training. |
| Dataset Splits | No | The paper does not explicitly provide specific details about training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | The detailed setup can be found in Appendix H. In short, we first discretize the spaces into very fine grid and run standard value iteration to obtain a proxy of Q. The proxy has a very small approximate rank in all tasks; we hence use r = 10 for our experiments. As mentioned, we simply select r states and r actions that are far from each other in their respective metric. |