A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning

Authors: Archit Sharma, Rehaan Ahmad, Chelsea Finn

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks from the EARL benchmark, with 40% gains on the hardest task, while making fewer assumptions than prior works. In Section 5, we empirically analyze the performance of MEDAL on the Environments for Autonomous RL (EARL) benchmark (Sharma et al., 2022).
Researcher Affiliation Academia 1Stanford University, CA, USA. Correspondence to: Archit Sharma <architsh@stanford.edu>, Rehaan Ahmad <rehaan@stanford.edu>.
Pseudocode Yes The pseudocode for MEDAL is provided in Algorithm 1, and further implementation details can be found in Appendix A. Algorithm 1 Matching Expert Distributions for Autonomous Learning (MEDAL)
Open Source Code No The paper provides a link to a project homepage ("https://sites.google.com/view/medal-arl/home") under "Code and videos are at:", but this is not a direct link to a source-code repository and does not unambiguously state that source code for the methodology is provided at that URL.
Open Datasets Yes We consider three sparse-reward continuous-control environments from the EARL benchmark (Sharma et al., 2022). The environment details can be found in (Sharma et al., 2022).
Dataset Splits No The paper does not explicitly provide details about validation dataset splits (e.g., percentages or specific counts for a validation set).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies No The paper mentions software components like "TF-Agents" and "SAC" but does not specify their version numbers, which are necessary for reproducibility.
Experiment Setup Yes Hyperparameters follow the default values: initial collect steps: 10,000, batch size sampled from replay buffer for updating policy and critic: 256, steps collected per iteration: 1, trained per iteration: 1, discount factor: 0.99, learning rate: 3e 4 (for critics, actors, and discriminator). The actor and critic network were parameterized as neural networks with two hidden layers each of size 256. For the discriminator, it was parameterized as a neural network with one hidden layer of size 128. The batch size for the discriminator is set to 800 for all environments.