Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

World Model on Million-Length Video And Language With Blockwise RingAttention

Authors: Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we train one of the largest context size transformers to date on long text documents and videos and achieved competitive results on long video understanding and long context fact retrieval. (b) We discover a range of challenges associated with training on long sequences and propose solutions for them: masked sequence packing to effectively train with different sequence lengths and synthetic model-generating question-answering for effective attention. (c) We provide an open-source and optimized implementation for training with millions of tokens in context, as well as a family of Llama-based 1M context models capable of processing long documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat).
Researcher Affiliation	Academia	Hao Liu Wilson Yan Matei Zaharia Pieter Abbeel UC Berkeley Equal contribution. Email: EMAIL, EMAIL
Pseudocode	No	The paper describes methods and architectural designs using figures (e.g., Figure 3 for Model Architecture, Figure 4 for training process) but does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models of Large World Model (LWM) are available at https://largeworldmodel.github.io/lwm/.
Open Datasets	Yes	We curate an extensive dataset of long-form videos and books from public sources...For each stage, we train on different filtered versions of the Books3 dataset from The Pile (Gao et al., 2020)... For chat fine-tuning, we train each model on a mix of the Ultra Chat conversation dataset (Ding et al., 2023) and our custom question-answering dataset... We train on large set of text-image dataset comprising of a mix of LAION-2B-en (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022)... We train on a text-video dataset mix of Web Vid10M (Bain et al., 2021) and 3M Intern Vid10M (Wang et al., 2023) examples... we additionally mix 16% of the batch to be pure text data from Open LLa MA (Geng and Liu, 2023).
Dataset Splits	No	The paper describes how training data is curated and mixed (e.g., Ultra Chat and custom QA data at a 7:3 ratio) and used in progressive training stages, but it does not specify explicit train/test/validation splits for its own experiments or for evaluation on newly generated data. Evaluation is performed on established benchmarks which typically have their own predefined splits.
Hardware Specification	Yes	We trained our models using TPUv4-1024, which is approximately equivalent to 450 A100s... Compute (TPU) v4-512... Compute (TPU) v4-1024... Inference for such long sequences requires a minimum of v4-128 with a TPU mesh sharding of 32 tensor parallelism, and 4 sequence parallelism (ring dimension).
Software Dependencies	No	The paper mentions several tools and methods such as Blockwise Ring Attention (Liu et al., 2024; Liu and Abbeel, 2023), Flash Attention (Dao et al., 2022) using Pallas (Bradbury et al., 2018), VQGAN (Esser et al., 2021), MUSEd (Patil et al., 2024), and Llama2 7B (Touvron et al., 2023b), but it does not provide specific version numbers for general ancillary software like programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	Table 6 details information about each training stage, such as the number of tokens, total time, and the Books3 dataset filtering constraints. Each successive run is initialized from the prior sequence length. Table 7 provides further training details for each run. Table 8 shows details for each training stage... Appendix J contains Tables 12, 13, and 14 which provide extensive training hyperparameters including Parameters, Precision, Sequence Length, Ro PE θ, Tokens per Batch, Total Tokens, Total Steps, LR Schedule, LR Warmup Steps, Max LR, Min LR, and Mesh Sharding.