Multi-step Greedy Reinforcement Learning Algorithms

Authors: Manan Tomar, Yonathan Efroni, Mohammad Ghavamzadeh

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When evaluated on a range of Atari and Mu Jo Co benchmark tasks, our results indicate that for the right range of , our algorithms outperform DQN and TRPO.
Researcher Affiliation Collaboration 1Facebook AI Research, Menlo Park, USA 2Technion, Haifa, Israel 3Google Research, Mountain View, USA.
Pseudocode Yes Algorithm 1 -Policy Iteration; Algorithm 2 -Value Iteration; Algorithm 3 -PI-DQN; Algorithm 4 -PI-TRPO
Open Source Code No The paper cites external codebases like Open AI Baselines but does not provide concrete access to its own source code.
Open Datasets Yes We choose to test our -DQN and -TRPO algorithms on the Atari and Mu Jo Co benchmarks, respectively.
Dataset Splits No The paper describes total sample counts for training and iterations but does not provide specific train/validation/test dataset splits (percentages or counts) in the conventional sense.
Hardware Specification No The paper mentions using 'standard setups' but does not provide specific hardware details (e.g., exact GPU/CPU models or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions optimization algorithms like 'Adam optimizer' and components like 'target Q value networks' but does not list specific software libraries or solvers with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup Yes Both of these algorithms use standard setups, including the use of the Adam optimizer for performing gradient descent, a discount factor of 0.99 across all tasks, target Q value networks in the case of -DQN and an entropy regularizer with a coefficient of 0.01 in the case of -TRPO. ... we set the total number of iterations to 2000, with each iteration consisting 1000 samples. ... CF A is set to 0.05 for all our experiments with other Atari domains. ... we set CF A = 0.2 in our experiments with other Mu Jo Co domains.