Reinforcement Learning from Passive Data via Latent Intentions

Authors: Dibya Ghosh, Chethan Anand Bhateja, Sergey Levine

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate the ability to learn useful features for downstream RL from a number of passive data sources, including learning from videos of different embodiments in the XMagical benchmark, and learning from raw You Tube videos for the offline Atari benchmark.
Researcher Affiliation Academia Dibya Ghosh 1 Chethan Bhateja 1 Sergey Levine 1 1UC Berkeley. Correspondence to: Dibya Ghosh <dibya@berkeley.edu>.
Pseudocode Yes Algorithm 1 Learning Intent-Conditioned Value Functions from Passive Data
Open Source Code Yes Accompanying code at https://github.com/dibyaghosh/icvf release.
Open Datasets Yes To encompass the many forms of passive data that we may wish to pre-train on, we evaluate on passive data from the D4RL benchmark (Fu et al., 2020), videos of agents with different embodiments in the XMagical benchmark (Toyer et al., 2020; Zakka et al., 2021), and scraped Youtube videos of Atari 2600 games (Bellemare et al., 2013). ... During pre-training, the agent receives a passive dataset of state-observation sequences of the agent moving to different locations in a maze (1 106 frames), which we construct by stripping all annotations from the publically available dataset. ... We build on the dataset of video released by Zakka et al. (2021)
Dataset Splits No The paper describes training and testing procedures and dataset sizes for these, but does not explicitly mention a separate 'validation' dataset split with specific percentages or counts. The term 'finetuning' is used for downstream tasks without specifying a validation split for hyperparameter tuning during that phase.
Hardware Specification Yes The research was supported by the TPU Research Cloud
Software Dependencies No The paper mentions specific algorithms and models used (e.g., IQL, CQL, QRDQN, Impala visual encoder) but does not provide specific version numbers for any software libraries, frameworks, or programming languages used for implementation.
Experiment Setup Yes We perform the coarse hyperparameter sweep α [1, 10, 100] for each representation and domain to choose α, since the scales of the downstream losses can vary between domains, and the scales of the representation loss can vary between different methods. For our method, we set the expectile parameter for our method to α = 0.9 for all the tasks, and train using temporal difference learning with a target network lagging via polyak averaging with rate λ = 0.005. ... Pre-training proceeds for 250k timesteps for all methods... running for 1M steps