Mirror Learning: A Unifying Framework of Policy Optimisation

Authors: Jakub Grudzien, Christian A Schroeder De Witt, Jakob Foerster

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the correctness of our theory with numerical experiments. Their purpose is not to establish a new stateof-the-art performance in the most challenging deep RL benchmarks. It is, instead, to demonstrate that algorithms that fall into the mirror learning framework obey Theorem 3.6 and Inequality (9). Hence, to enable a close connection between the theory and experiments we choose simple environments and for drift functionals we selected: KLdivergence, squared L2 distance, squared total variation distance, and the trivial (zero) drift.
Researcher Affiliation Academia 1University of Oxford. Correspondence to: Jakub Grudzien Kuba <jakub.grudzien@new.ox.ac.uk>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides mathematical equations and descriptions of processes.
Open Source Code Yes The code is available at https: //github.com/znowu/mirror-learning.
Open Datasets No The paper describes custom environments ("Single-step Game", "Tabular Game", "Grid World") used for experiments but does not provide concrete access information (link, DOI, repository) for these environments or any datasets used.
Dataset Splits No The paper describes the custom environments used for experiments but does not provide specific dataset split information (e.g., percentages, sample counts, or predefined splits) for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions running numerical experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library names with versions, solver names with versions) needed to replicate the experiment.
Experiment Setup No The paper states: "In all experiments, we set the initial-state and sampling distributions to uniform." and "For an exact verification of the theoretical results, we test each algorithm over only one random seed." While these are some setup details, the paper lacks specific hyperparameters (e.g., learning rate, batch size, number of epochs) for the policy optimization algorithms used in the experiments, which are crucial for reproducibility.