Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation
Authors: Stefan Stojanovic, Yassir Jedra, Alexandre Proutiere
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A Numerical Experiments All experiments in this section were performed on HP Elite Book 830 G8 with an Intel i7 core and 16 GB of RAM. Each experiment s runtime for individual realizations took at most 2-3 hours, and reproducing all results is feasible within a day. |
| Researcher Affiliation | Academia | Stefan Stojanovic KTH, Stockholm, Sweden stesto@kth.se Yassir Jedra MIT, Cambridge, USA jedra@mit.edu Alexandre Proutiere KTH, Digital Futures, Stockholm, Sweden alepro@kth.se |
| Pseudocode | Yes | Algorithm 1: Low-Rank Policy Iteration (Lo Ra-PI) |
| Open Source Code | Yes | Please refer to Appendix A and provided code in the supplementary material. |
| Open Datasets | No | The paper mentions using “synthetically generated low-rank MDPs” for numerical experiments (Appendix A). It does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning for training, validation, and testing. It describes using synthetically generated MDPs without explicit split details. |
| Hardware Specification | Yes | All experiments in this section were performed on HP Elite Book 830 G8 with an Intel i7 core and 16 GB of RAM. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers (e.g., library names like PyTorch or TensorFlow with their versions). |
| Experiment Setup | Yes | We considered an MDP with S = A = 2, γ = 0.87, a reward matrix given by... We initialized VI with V (0) = [2.86 2.98] . For Lo Ra-VI: S = A = 1000, γ = 0.1. We used K = 10 anchors, V (0) = 0, rewards are noisy with Gaussian noise σ = 0.01. |