Learning to Act without Actions

Authors: Dominik Schmidt, Minqi Jiang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments provide strong evidence that such latent policies accurately capture the observed expert s behavior, by showing that they can be efficiently fine-tuned into expert-level policies in the true action space. In our experiments in Section 6, we train a latent IDM via LAPO on large expert-level action-free offline datasets for each of the 16 games of the Procgen Benchmark (Cobbe et al., 2019; 2020). As can be seen in Figure 3, using LAPO s latent policy we are able to fully recover expert performance within only 4M frames, while PPO from scratch reaches only 44% of expert performance in the same period. We also provide results for two ablations.
Researcher Affiliation Collaboration Dominik Schmidt Weco AI Minqi Jiang FAIR at Meta AI. Correspondence to dominik.schmidt.22@ucl.ac.uk .
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. The methods are described textually and visually (e.g., Figure 2).
Open Source Code Yes Our code is available here: https://github.com/schmidtdominik/LAPO.
Open Datasets Yes Our experiments center on the Procgen Benchmark (Cobbe et al., 2020), as it features a wide variety of tasks that present different challenges for our method. Our observation-only dataset consists of approximately 8M frames sampled from an expert policy that was trained with PPO for 50M frames.
Dataset Splits No The paper describes the overall dataset size and interaction frames for training and fine-tuning but does not provide specific percentages or counts for train/validation/test splits of the datasets used for the initial training of LAPO or for the offline decoding phase. For example, it mentions 'approximately 8M frames' and '4M steps' for PPO, but no explicit splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only implies computations are performed: 'By interacting with the online environment...'
Software Dependencies No The paper mentions software components like IMPALA-CNN, U-Net, and PPO, but it does not specify exact version numbers for these or any other software libraries or frameworks. For example: 'PPO hyperparameters As in (Cobbe et al., 2020).'
Experiment Setup Yes We use the IMPALA-CNN (Espeholt et al., 2018) to implement both our policy and IDM with a 4x channel multiplier as used by Cobbe et al. (2020), and U-Net (Ronneberger et al., 2015) based on a Res Net backbone (He et al., 2016) with approximately 8M parameters for the FDM. The latent action decoder is a fully-connected network with hidden sizes (128, 128). We use a single observation of additional pre-transition context, i.e. k = 1. We keep all convolutional layers frozen and found that a much larger learning rate of 0.01 can be stably used when only training these final few layers. Other hyperparameters are given in Appendix A.4. (Table 1 in Appendix A.4 lists specific hyperparameters like learning rate, batch size, VQ parameters, etc.)