Offline Quantum Reinforcement Learning in a Conservative Manner
Authors: Zhihao Cheng, Kaining Zhang, Li Shen, Dacheng Tao
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct abundant experiments to demonstrate that the proposed method CQ2L can successfully solve offline QRL tasks that the online counterpart could not. |
| Researcher Affiliation | Collaboration | The University of Sydney, Australia 2 JD Explore Academy, China |
| Pseudocode | Yes | Algorithm 1: Conservative Quantum Q-learning (CQ2L) |
| Open Source Code | No | The paper mentions using open-source frameworks like Tensorflow Quantum and Cirq, and refers to d3rlpy for offline data creation, but does not provide a specific link or explicit statement about the availability of *their* implementation code for CQ2L. |
| Open Datasets | Yes | We select three Open AI classic control tasks Cart Pole-v0, Acrobot-v1, and Mountain Car-v0. ... we create offline data for Cart Pole-v0, Acrobot-v1, and Mountain Car-v0 in a similar way as d3rlpy (Seno and Imai 2021). ... We refer readers to Seno and Imai (2021) and their codes for more details. |
| Dataset Splits | No | The paper mentions creating offline data and evaluating algorithms, but it does not specify a separate validation dataset or split. It focuses on training and testing/evaluation. |
| Hardware Specification | No | The paper states that Tensorflow Quantum and Cirq were used to simulate quantum states, but it does not specify any details about the underlying classical hardware (e.g., CPU, GPU models, memory) used for these simulations or experiments. |
| Software Dependencies | No | We implement the CQ2L algorithm according to Skolik, Jerbi, and Dunjko (2022); Jerbi et al. (2021); Seno and Imai (2021), in which Tensorflow Quantum (Broughton et al. 2020) and Cirq (Hancock et al. 2019) are used to simulate quantum states. ... updated Q(s, a) utilizing an Adam optimizer (Kingma and Ba 2014). The paper cites the frameworks but does not provide specific version numbers for these software dependencies (e.g., Tensorflow Quantum version, Cirq version, Python version). |
| Experiment Setup | Yes | In experiments, we use VQCs with 5 layers to represent Q-value functions. There are 4, 6, and 2 qubits of VQCs for Cart Pole-v0, Acrobot-v1, and Mountain Car-v0, respectively. ... The learning rates for VQCs parameters ξθ = [ξλ, ξϕ, ξν] are 0.001, 0.001, and 0.1, respectively. The target Q-value Y Double Q k is calculated with the discount factor γ = 0.99 and then forwarded into a Huber loss (Akkaya and Pınar 2020). For every iteration, we sample data from D with a batch size of 16 and update Q(s, a) utilizing an Adam optimizer (Kingma and Ba 2014). |