Multi-step Greedy Reinforcement Learning Algorithms
Authors: Manan Tomar, Yonathan Efroni, Mohammad Ghavamzadeh
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When evaluated on a range of Atari and Mu Jo Co benchmark tasks, our results indicate that for the right range of , our algorithms outperform DQN and TRPO. |
| Researcher Affiliation | Collaboration | 1Facebook AI Research, Menlo Park, USA 2Technion, Haifa, Israel 3Google Research, Mountain View, USA. |
| Pseudocode | Yes | Algorithm 1 -Policy Iteration; Algorithm 2 -Value Iteration; Algorithm 3 -PI-DQN; Algorithm 4 -PI-TRPO |
| Open Source Code | No | The paper cites external codebases like Open AI Baselines but does not provide concrete access to its own source code. |
| Open Datasets | Yes | We choose to test our -DQN and -TRPO algorithms on the Atari and Mu Jo Co benchmarks, respectively. |
| Dataset Splits | No | The paper describes total sample counts for training and iterations but does not provide specific train/validation/test dataset splits (percentages or counts) in the conventional sense. |
| Hardware Specification | No | The paper mentions using 'standard setups' but does not provide specific hardware details (e.g., exact GPU/CPU models or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions optimization algorithms like 'Adam optimizer' and components like 'target Q value networks' but does not list specific software libraries or solvers with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8'). |
| Experiment Setup | Yes | Both of these algorithms use standard setups, including the use of the Adam optimizer for performing gradient descent, a discount factor of 0.99 across all tasks, target Q value networks in the case of -DQN and an entropy regularizer with a coefficient of 0.01 in the case of -TRPO. ... we set the total number of iterations to 2000, with each iteration consisting 1000 samples. ... CF A is set to 0.05 for all our experiments with other Atari domains. ... we set CF A = 0.2 in our experiments with other Mu Jo Co domains. |