GenRL: Multimodal-foundation world models for generalization in embodied agents
Authors: Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron C. Courville, Sai Rajeswar Mudumba
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As assessed through large-scale multi-task benchmarking in locomotion and manipulation domains, Gen RL enables multi-task generalization from language and visual prompts. 4 Experiments Overall, we employ a set of 4 locomotion environments (Walker, Cheetah, Quadruped, and a newly introduced Stickman environment) [54] and one manipulation environment (Kitchen) [22], for a total of 35 tasks where the agent is trained without rewards, using only visual or language prompts. |
| Researcher Affiliation | Collaboration | Pietro Mazzaglia IDLab, Ghent University Tim Verbelen VERSES AI Research Lab Bart Dhoedt IDLab, Ghent University Aaron Courville Mila, University of Montreal Sai Rajeswar Service Now Research |
| Pseudocode | No | The paper describes algorithms verbally and with equations but does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Website, code and data: mazpie.github.io/genrl |
| Open Datasets | Yes | Overall, we employ a set of 4 locomotion environments (Walker, Cheetah, Quadruped, and a newly introduced Stickman environment) [54] and one manipulation environment (Kitchen) [22], for a total of 35 tasks where the agent is trained without rewards, using only visual or language prompts. |
| Dataset Splits | No | The paper discusses training on large datasets and evaluating on a set number of episodes, and distinguishes between 'in-distribution' and 'generalization' tasks, but does not explicitly provide numerical train/validation/test dataset splits. |
| Hardware Specification | Yes | We use a cluster of V100 with 16GB of VRAM for all our experiments. |
| Software Dependencies | No | The paper mentions models and architectures like 'Dreamer V3', 'Intern Video2', and 'Sig LIP-B', and adapters like 'Dr Q-v2 encoder', but does not provide specific version numbers for software dependencies (e.g., programming languages, libraries, or frameworks with their versions). |
| Experiment Setup | Yes | Hyperparameters. For the hyperparameters, we follow Dreamer V3 [26] (version 1 of the paper, dated January 2023). Differences from the default hyperparameters or model size choices are illustrated in Table 2. Table 2: World model and actor-critic hyperparameters. Name Value Multimodal Foundation World Model Batch size 48 Sequence length 48 GRU recurrent units 1024 CNN multiplier 48 Dense hidden units 1024 MLP layers 4 Actor-Critic Batch size 32 Sequence length 32 |