A new convergent variant of Q-learning with linear function approximation

Authors: Diogo Carvalho, Francisco S. Melo, Pedro Santos

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated the CQL algorithm on three domains with increasing complexity. The first was the θ 2θ example [21] and the second was the 7-star version of the star counterexample [2]. Both problems are known two cause divergence of Q-learning with linear function approximation. We also tested the algorithm on the mountain car problem [15]. On each domain, we compare CQL with standard Q-learning and GGQ [12].
Researcher Affiliation Academia INESC-ID & Instituto Superior Técnico, University of Lisbon Lisbon, Portugal
Pseudocode No The paper presents the algorithm's updates using mathematical equations (4a) and (4b) but does not provide a separate pseudocode block or algorithm listing.
Open Source Code No The paper does not provide any statement or link for open-sourcing the code.
Open Datasets Yes We evaluated the CQL algorithm on three domains with increasing complexity. The first was the θ 2θ example [21] and the second was the 7-star version of the star counterexample [2]. Both problems are known two cause divergence of Q-learning with linear function approximation. We also tested the algorithm on the mountain car problem [15].
Dataset Splits No The paper mentions running experiments for a certain number of episodes and runs (e.g., "averaged over 30 runs of 103 episodes"), but does not specify traditional training/validation/test data splits for a static dataset. It describes evaluation processes within an RL environment context.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models or other hardware specifications used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or other libraries/solvers).
Experiment Setup Yes On the first two domains, results were averaged over 30 runs of 103 episodes, considered γ = 0.99 and constant learning-rates: α = 0.1 for the original algorithm; α = 0.05, β = 0.25 for CQL and GGQ. ... In the most simple example [21] there are only two states and one action, and the reward is always zero. The only feature has value 1 for the first state and 2 for the second state. We scaled the feature by a factor of 1/2, set the initial weight as 1, and randomly initialized every episode with equal probability for each state. Each episode consisted of a transition and the update. ... The dotted action and the solid action were then chosen with probability 5/6 and 1/6, respectively, on every episode. ... The basis functions used were bi-dimensional Gaussians. Each of the Gaussians had mean in the center of a l x l grid over the state space X of position and velocity pairs. The standard deviations were σp and σv on the position and velocity dimensions, respectively. ... On each run, the initial vectors were 0. The learning parameters used were pairs (α, β) { (10 i, 10 j), i = 1, . . . , 4, j = i, . . . , 4 }. ... Each episode ended when the car successfully climbed the hill or 200 transitions were made. An ϵ-greedy policy was used to learn, with ϵ = 0.3.