Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning
Authors: Tongzhou Wang, Antonio Torralba, Phillip Isola, Amy Zhang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we conduct thorough analyses on a discretized Mountain Car environment, identifying properties of QRL and its advantages over alternatives. On offline and online goal-reaching benchmarks, QRL also demonstrates improved sample efficiency and performance, across both state-based and image-based observations. |
| Researcher Affiliation | Collaboration | 1MIT 2UT Austin 3Meta AI. Correspondence to: Tongzhou Wang <tongzhou@mit.edu>. |
| Pseudocode | No | The paper presents equations for its objective function (e.g., Equation 12) but does not include any clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Code: github.com/quasimetric-learning/quasimetric-rl |
| Open Datasets | Yes | On offline maze2d tasks, QRL performs well in singlegoal and multi-goal evaluations, improving > 37% over the best baseline and > 46% over the d4rl handcoded reference controller (Fu et al., 2020). ... we use the Fetch robot environments from the GCRL benchmark (Plappert et al., 2018). |
| Dataset Splits | No | The paper does not provide specific details on train/validation/test dataset splits (e.g., percentages, sample counts, or explicit methodology for splitting) beyond mentioning the datasets used for training and evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Adam (Kingma & Ba, 2014)' as an optimizer but does not provide specific version numbers for other key software components like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA. |
| Experiment Setup | Yes | All our results are aggregation from 5 runs with different seeds. QRL. Across all experiments, we use ϵ = 0.25, initialize Lagrange multiplier λ = 0.01, and use Adam (Kingma & Ba, 2014) to optimize all parameters. ... Our learning rates are 0.01 for λ, 1 10 4 for the model parameters, and 3 10 5 for the policy parameters. We use a batch size of 256 in training. We prefill the replay buffer with 200 episodes from a random actor, and then iteratively perform (1) generating 10 rollouts and (2) optimizing QRL objective for 500 gradients steps. We use N(0, 0.32)-perturbed action noise in exploration. |