GenRL: Multimodal-foundation world models for generalization in embodied agents

Authors: Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron C. Courville, Sai Rajeswar Mudumba

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As assessed through large-scale multi-task benchmarking in locomotion and manipulation domains, Gen RL enables multi-task generalization from language and visual prompts. 4 Experiments Overall, we employ a set of 4 locomotion environments (Walker, Cheetah, Quadruped, and a newly introduced Stickman environment) [54] and one manipulation environment (Kitchen) [22], for a total of 35 tasks where the agent is trained without rewards, using only visual or language prompts.
Researcher Affiliation Collaboration Pietro Mazzaglia IDLab, Ghent University Tim Verbelen VERSES AI Research Lab Bart Dhoedt IDLab, Ghent University Aaron Courville Mila, University of Montreal Sai Rajeswar Service Now Research
Pseudocode No The paper describes algorithms verbally and with equations but does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Website, code and data: mazpie.github.io/genrl
Open Datasets Yes Overall, we employ a set of 4 locomotion environments (Walker, Cheetah, Quadruped, and a newly introduced Stickman environment) [54] and one manipulation environment (Kitchen) [22], for a total of 35 tasks where the agent is trained without rewards, using only visual or language prompts.
Dataset Splits No The paper discusses training on large datasets and evaluating on a set number of episodes, and distinguishes between 'in-distribution' and 'generalization' tasks, but does not explicitly provide numerical train/validation/test dataset splits.
Hardware Specification Yes We use a cluster of V100 with 16GB of VRAM for all our experiments.
Software Dependencies No The paper mentions models and architectures like 'Dreamer V3', 'Intern Video2', and 'Sig LIP-B', and adapters like 'Dr Q-v2 encoder', but does not provide specific version numbers for software dependencies (e.g., programming languages, libraries, or frameworks with their versions).
Experiment Setup Yes Hyperparameters. For the hyperparameters, we follow Dreamer V3 [26] (version 1 of the paper, dated January 2023). Differences from the default hyperparameters or model size choices are illustrated in Table 2. Table 2: World model and actor-critic hyperparameters. Name Value Multimodal Foundation World Model Batch size 48 Sequence length 48 GRU recurrent units 1024 CNN multiplier 48 Dense hidden units 1024 MLP layers 4 Actor-Critic Batch size 32 Sequence length 32