Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning

Authors: Tongzhou Wang, Antonio Torralba, Phillip Isola, Amy Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we conduct thorough analyses on a discretized Mountain Car environment, identifying properties of QRL and its advantages over alternatives. On offline and online goal-reaching benchmarks, QRL also demonstrates improved sample efficiency and performance, across both state-based and image-based observations.
Researcher Affiliation Collaboration 1MIT 2UT Austin 3Meta AI. Correspondence to: Tongzhou Wang <tongzhou@mit.edu>.
Pseudocode No The paper presents equations for its objective function (e.g., Equation 12) but does not include any clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code: github.com/quasimetric-learning/quasimetric-rl
Open Datasets Yes On offline maze2d tasks, QRL performs well in singlegoal and multi-goal evaluations, improving > 37% over the best baseline and > 46% over the d4rl handcoded reference controller (Fu et al., 2020). ... we use the Fetch robot environments from the GCRL benchmark (Plappert et al., 2018).
Dataset Splits No The paper does not provide specific details on train/validation/test dataset splits (e.g., percentages, sample counts, or explicit methodology for splitting) beyond mentioning the datasets used for training and evaluation.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions using 'Adam (Kingma & Ba, 2014)' as an optimizer but does not provide specific version numbers for other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA.
Experiment Setup Yes All our results are aggregation from 5 runs with different seeds. QRL. Across all experiments, we use ϵ = 0.25, initialize Lagrange multiplier λ = 0.01, and use Adam (Kingma & Ba, 2014) to optimize all parameters. ... Our learning rates are 0.01 for λ, 1 10 4 for the model parameters, and 3 10 5 for the policy parameters. We use a batch size of 256 in training. We prefill the replay buffer with 200 episodes from a random actor, and then iteratively perform (1) generating 10 rollouts and (2) optimizing QRL objective for 500 gradients steps. We use N(0, 0.32)-perturbed action noise in exploration.