Parametrized Quantum Policies for Reinforcement Learning

Authors: Sofiene Jerbi, Casper Gyurik, Simon Marshall, Hans Briegel, Vedran Dunjko

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a hybrid quantum-classical reinforcement learning model using very few qubits, which we show can be effectively trained to solve several standard benchmarking environments. In this section, we evaluate the influence of these design choices on learning performance through numerical simulations. We consider three classical benchmarking environments from the Open AI Gym library [24]: Cart Pole, Mountain Car and Acrobot.
Researcher Affiliation Academia Sofiene Jerbi Institute for Theoretical Physics, University of Innsbruck sofiene.jerbi@uibk.ac.at Casper Gyurik LIACS, Leiden University Simon C. Marshall LIACS, Leiden University Hans J. Briegel Institute for Theoretical Physics, University of Innsbruck Vedran Dunjko LIACS, Leiden University
Pseudocode Yes Algorithm 1: REINFORCE with PQC policies and value-function baselines
Open Source Code Yes Code An accompanying tutorial [36], implemented as part of the quantum machine learning library Tensor Flow Quantum [37], provides the code required to reproduce our numerical results and explore different settings.
Open Datasets Yes We consider three classical benchmarking environments from the Open AI Gym library [24]: Cart Pole, Mountain Car and Acrobot.
Dataset Splits No The paper describes training and evaluating agents in reinforcement learning environments by generating episodes, but it does not specify explicit validation dataset splits (e.g., percentages or counts for a separate validation set) as typically found in supervised learning.
Hardware Specification No The paper mentions: "The computational results presented here have been achieved in part using the LEO HPC infrastructure of the University of Innsbruck." and in Appendix D.1: "Our numerical simulations were performed on a CPU-based cluster." These statements indicate the type of computing environment but do not provide specific details such as CPU/GPU models, memory, or number of cores.
Software Dependencies No The paper mentions the use of "Tensor Flow Quantum [37]" but does not specify its version number or list other software dependencies with their specific versions.
Experiment Setup Yes Apart from the PQC depth, the shared hyperparameters of these two models were jointly picked as to give the best overall performance for both; the hyperparameters specific to each model were optimized independently. In our hyperparameter search, we evaluated the performance of DNNs with a wide range of depths (number of hidden layers between 2 to 10) and widths (number of units per hidden layer between 8 and 64), and kept the architecture with the best average performance (depth 4, width 16).