Learning to Model the World With Language
Authors: Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments Our experiments test the following hypotheses: H1) Aligning image and language as single (image, token) pairs per timestep outperforms other methods for incorporating language into Dreamer V3 (Section 4.1). H2) Dynalang can better utilize diverse types of language to improve task performance over language-conditioned policies. |
| Researcher Affiliation | Academia | Jessy Lin 1 Yuqing Du 1 Olivia Watkins 1 Danijar Hafner 1 Pieter Abbeel 1 Dan Klein 1 Anca Dragan 1 1UC Berkeley. Website: dynalang.github.io |
| Pseudocode | Yes | Algorithm 1: Dynalang define rewards rt, episode continue flag ct, images xt, language tokens lt, actions at, model state (ht, zt). while acting do Step environment rt, ct, xt, lt env(at 1). Encode observations zt enc(xt, lt, ht). Execute action at π(at | ht, zt). Add transition (rt, ct, xt, lt, at) to replay buffer. while training do Draw batch {(rt, ct, xt, lt, at)} from replay buffer. Use world model to compute multimodal representations zt, future predictions ˆzt+1, and decode ˆxt, ˆlt, ˆrt, ˆct. Update world model to minimize Lpred + Lrepr. Imagine rollouts from all zt using π. Update actor to minimize Lπ. Update critic to minimize LV . while text pretraining do Sample text batch {lt} from dataset. Create zero images xt and actions at. Use world model to compute representations zt, future predictions ˆzt+1, and decode ˆlt. Update world model to minimize Lpred + Ll. |
| Open Source Code | No | The paper provides a website link (dynalang.github.io) in the author affiliations, but does not contain an explicit statement that the code for the described methodology is publicly available there or elsewhere. |
| Open Datasets | Yes | On the Messenger benchmark (Hanjie et al., 2021), we show that Dynalang can read game manuals... In vision-language navigation (Krantz et al., 2020), we show that Dynalang can also follow instructions... Agents must navigate Matterport3D panoramas captured in real homes (Chang et al., 2017)... We evaluate this capability... with Tiny Stories (Eldan & Li, 2023), a dataset of 2M short stories ( 500M tokens) generated by GPT-4. |
| Dataset Splits | No | The paper mentions using a 'training dataset' for VLN-CE and refers to 'train' and 'test' implicitly through experiments, but does not provide explicit details about training/validation/test splits (e.g., percentages, sample counts, or specific split names) for reproducibility. |
| Hardware Specification | Yes | All models were trained on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions the use of the T5 tokenizer, T5-small, Seed RL repository implementations, RSSM, and GRU, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Table I.3. Dynalang hyperparameters and training information for each environment. We use the default model hyperparameters for the XL Dreamer V3 model unless otherwise specified below. |