Quantum Policy Gradient Algorithm with Optimized Action Decoding
Authors: Nico Meyer, Daniel Scherer, Axel Plinge, Christopher Mutschler, Michael Hartmann
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The resulting algorithm demonstrates a significant performance improvement in several benchmark environments. With this technique, we successfully execute a full training routine on a 5-qubit hardware device. |
| Researcher Affiliation | Collaboration | 1Fraunhofer IIS, Fraunhofer Institute for Integrated Circuits IIS, Nuremberg, Germany 2Department of Physics, Friedrich-Alexander University Erlangen-Nuremberg (FAU), Erlangen, Germany. |
| Pseudocode | No | The paper describes methods through textual descriptions, mathematical equations, and circuit diagrams (e.g., Figure 2), but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | A repository with the framework to reproduce the main results of this paper is available at https://gitlab. com/Nico Meyer/qpg_classicalpp. |
| Open Datasets | Yes | The experiments in Sections 5.1 and 5.2 focus on the Cart Pole environment, while Section 5.3 and Appendix F also consider Contextual Bandits and Frozen Lake, respectively. |
| Dataset Splits | No | The paper mentions environments like Cart Pole-v0, Frozen Lake, and Contextual Bandits, but does not provide specific details on how these datasets were split into training, validation, and test sets. It mentions 'results are usually averaged over ten independent runs' which is about experiment aggregation, not data splitting. |
| Hardware Specification | Yes | The computations were executed on a CPU-cluster with 64 nodes, each equipped with 4 cores and 32 GB of working memory. |
| Software Dependencies | No | The implementation is based upon the qiskit and qiskit machine learning libraries. If not stated differently, all experiments use the Statevector Simulator, which assumes the absence of noise, and also eliminates sampling errors. The employed hardware backend is the 5-qubit device ibmq manila v1.1.4 (IBM Quantum, 2023). |
| Experiment Setup | Yes | all experiments on the Cart Pole-v0 environment use a learning rate of αθ = 0.01 for the variational and αλ = 0.1 for the state scaling parameters. In all other environments, a value of α = 0.1 is used for all parameter sets. A similar distinction is made w.r.t. parameter initialization, where Cart Pole-v0 setups select θ N (0, 0.1), while the base option is always to draw the variational parameters uniformly at random from ( π, π]. The state scaling parameters are all initialized to the constant value 1.0. The parameter update is performed using the Adam optimizer (Kingma & Ba, 2015), modified with the AMSGrad adjustment (Reddi et al., 2018). A discount factor of γ = 0.99 is used in all cases. No baseline function is used in any of the environments, as performance was found to be sophisticated even without. If not stated differently, the architecture from Figure 2 with a depth of d = 1 is used, where the number of qubits is adjusted to match the state dimensionality. In order to make RL training curves a bit more stable, the results are usually averaged over ten independent runs. Additionally, the performance is averaged over the last 20 episodes (displayed in darker colors). Some plots also denote the performance of a random agent with a black dashed line and the optimal expected reward with a solid black one. |