Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Authors: Tianwei Ni, Benjamin Eysenbach, Ruslan Salakhutdinov

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments aim to answer two questions. First, how does a well-tuned implementation of recurrent model-free RL compare to specialized POMDP methods? To give these prior methods the strongest possible footing, we perform the comparison on the benchmarks used by these prior methods. Second, which design decisions are essential for recurrent model-free RL? We put the environment details in App. D.
Researcher Affiliation Academia Tianwei Ni 1 Benjamin Eysenbach 2 Ruslan Salakhutdinov 2 1Universit e de Montr eal & Mila Quebec AI Institute 2Carnegie Mellon University. Correspondence to: Tianwei Ni <tianwei.ni@mila.quebec>, Benjamin Eysenbach <beysenba@cs.cmu.edu>.
Pseudocode No The paper does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We also release a simple and efficient implementation of recurrent model-free RL for future work to use as a baseline for POMDPs.
Open Datasets Yes We adopt the occlusion benchmark proposed by VRM, replace the deprecated roboschool with Py Bullet (Coumans & Bai, 2016) as suggested by the official github repository2. We follow the practice in VRM (Han et al., 2020) in the other aspects of environment design, i.e. we remove all the position/angle-related entries in the observation space for -V environments and velocity-related entries for -P environments, to transform the original MDP into POMDP.
Dataset Splits No The paper mentions training and testing tasks and environments but does not provide explicit details on data splitting (e.g., percentages or counts for training, validation, and test sets).
Hardware Specification Yes The computer system we used during the experiments includes a Ge Force RTX 2080 Ti Graphic Card (with 11GB memory) and Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (with 250GB RAM and 80 cores).
Software Dependencies No The paper mentions software like Py Bullet and Stable Baseline3, but it does not specify version numbers for these or other core software dependencies (e.g., Python, PyTorch/TensorFlow) that are crucial for reproducibility.
Experiment Setup Yes Table 5: Hyperparameter summary in our implementation of model-free recurrent RL. For each benchmark, we report the hidden layer size of each module, RL and training hyperparameters. For meta-RL, we take the model on Cheetah-Vel as example, which follows the architecture design of off-policy vari BAD (Dorfman et al., 2020). The hidden size of observation-action embedder is the sum of that of observation embedder, previous action embedder (if exists), and reward embedder (if exists).