Offline Quantum Reinforcement Learning in a Conservative Manner

Authors: Zhihao Cheng, Kaining Zhang, Li Shen, Dacheng Tao

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct abundant experiments to demonstrate that the proposed method CQ2L can successfully solve offline QRL tasks that the online counterpart could not.
Researcher Affiliation Collaboration The University of Sydney, Australia 2 JD Explore Academy, China
Pseudocode Yes Algorithm 1: Conservative Quantum Q-learning (CQ2L)
Open Source Code No The paper mentions using open-source frameworks like Tensorflow Quantum and Cirq, and refers to d3rlpy for offline data creation, but does not provide a specific link or explicit statement about the availability of *their* implementation code for CQ2L.
Open Datasets Yes We select three Open AI classic control tasks Cart Pole-v0, Acrobot-v1, and Mountain Car-v0. ... we create offline data for Cart Pole-v0, Acrobot-v1, and Mountain Car-v0 in a similar way as d3rlpy (Seno and Imai 2021). ... We refer readers to Seno and Imai (2021) and their codes for more details.
Dataset Splits No The paper mentions creating offline data and evaluating algorithms, but it does not specify a separate validation dataset or split. It focuses on training and testing/evaluation.
Hardware Specification No The paper states that Tensorflow Quantum and Cirq were used to simulate quantum states, but it does not specify any details about the underlying classical hardware (e.g., CPU, GPU models, memory) used for these simulations or experiments.
Software Dependencies No We implement the CQ2L algorithm according to Skolik, Jerbi, and Dunjko (2022); Jerbi et al. (2021); Seno and Imai (2021), in which Tensorflow Quantum (Broughton et al. 2020) and Cirq (Hancock et al. 2019) are used to simulate quantum states. ... updated Q(s, a) utilizing an Adam optimizer (Kingma and Ba 2014). The paper cites the frameworks but does not provide specific version numbers for these software dependencies (e.g., Tensorflow Quantum version, Cirq version, Python version).
Experiment Setup Yes In experiments, we use VQCs with 5 layers to represent Q-value functions. There are 4, 6, and 2 qubits of VQCs for Cart Pole-v0, Acrobot-v1, and Mountain Car-v0, respectively. ... The learning rates for VQCs parameters ξθ = [ξλ, ξϕ, ξν] are 0.001, 0.001, and 0.1, respectively. The target Q-value Y Double Q k is calculated with the discount factor γ = 0.99 and then forwarded into a Huber loss (Akkaya and Pınar 2020). For every iteration, we sample data from D with a batch size of 16 and update Q(s, a) utilizing an Adam optimizer (Kingma and Ba 2014).