Learning to Model the World With Language

Authors: Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Our experiments test the following hypotheses: H1) Aligning image and language as single (image, token) pairs per timestep outperforms other methods for incorporating language into Dreamer V3 (Section 4.1). H2) Dynalang can better utilize diverse types of language to improve task performance over language-conditioned policies.
Researcher Affiliation Academia Jessy Lin 1 Yuqing Du 1 Olivia Watkins 1 Danijar Hafner 1 Pieter Abbeel 1 Dan Klein 1 Anca Dragan 1 1UC Berkeley. Website: dynalang.github.io
Pseudocode Yes Algorithm 1: Dynalang define rewards rt, episode continue flag ct, images xt, language tokens lt, actions at, model state (ht, zt). while acting do Step environment rt, ct, xt, lt env(at 1). Encode observations zt enc(xt, lt, ht). Execute action at π(at | ht, zt). Add transition (rt, ct, xt, lt, at) to replay buffer. while training do Draw batch {(rt, ct, xt, lt, at)} from replay buffer. Use world model to compute multimodal representations zt, future predictions ˆzt+1, and decode ˆxt, ˆlt, ˆrt, ˆct. Update world model to minimize Lpred + Lrepr. Imagine rollouts from all zt using π. Update actor to minimize Lπ. Update critic to minimize LV . while text pretraining do Sample text batch {lt} from dataset. Create zero images xt and actions at. Use world model to compute representations zt, future predictions ˆzt+1, and decode ˆlt. Update world model to minimize Lpred + Ll.
Open Source Code No The paper provides a website link (dynalang.github.io) in the author affiliations, but does not contain an explicit statement that the code for the described methodology is publicly available there or elsewhere.
Open Datasets Yes On the Messenger benchmark (Hanjie et al., 2021), we show that Dynalang can read game manuals... In vision-language navigation (Krantz et al., 2020), we show that Dynalang can also follow instructions... Agents must navigate Matterport3D panoramas captured in real homes (Chang et al., 2017)... We evaluate this capability... with Tiny Stories (Eldan & Li, 2023), a dataset of 2M short stories ( 500M tokens) generated by GPT-4.
Dataset Splits No The paper mentions using a 'training dataset' for VLN-CE and refers to 'train' and 'test' implicitly through experiments, but does not provide explicit details about training/validation/test splits (e.g., percentages, sample counts, or specific split names) for reproducibility.
Hardware Specification Yes All models were trained on NVIDIA A100 GPUs.
Software Dependencies No The paper mentions the use of the T5 tokenizer, T5-small, Seed RL repository implementations, RSSM, and GRU, but does not provide specific version numbers for these software components.
Experiment Setup Yes Table I.3. Dynalang hyperparameters and training information for each environment. We use the default model hyperparameters for the XL Dreamer V3 model unless otherwise specified below.