Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control

Authors: Ruili Feng, Han Zhang, Zhilei Shu, Zhantao Yang, Longxiang Tang, Zhicai Wang, Andy Zheng, Jie Xiao, Zhiheng Liu, Ruihang Chu, Yukun Huang, Yu Liu, Hongyang Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate performance using metrics for both general visual quality and movement control precision. For general visual quality, we use Fréchet Inception Distance (FID) [33], Fréchet Video Distance (FVD) [34], and CLIP Score [35] to assess text alignment. All metrics are evaluated on 2,048 seconds of randomly generated videos. To evaluate movement control precision, we generate 2,048 seconds of video based on keyboard inputs and text prompts from a fixed test set, then measure the Peak Signal-to-Noise Ratio (Move-PSNR) [36] and Learned Perceptual Image Patch Similarity (Move-LPIPS) [37] between the generated videos and real videos with ground truth movements. In this section, we evaluate the effectiveness of the Interactive Module by testing its performance in three distinct scenarios: the Forza Horizon 5 car driving scenario, the Cyberpunk 2077 city walking scenario, and a robotic arm task from the DROID dataset [38].
Researcher Affiliation Collaboration Ruili Feng1,5* , Han Zhang1,5*, Zhilei Shu1,5*, Zhantao Yang1,5*, Longxiang Tang1,5*, Zhicai Wang1,5, Andy Zheng3,5, Jie Xiao1,5, Zhiheng Liu1,5, Ruihang Chu1, Yukun Huang2,5, Yu Liu1 , Hongyang Zhang3,4 1Tongyi Lab, 2University of Hong Kong, 3University of Waterloo, 4Vector Insititute, 5Matrix Team
Pseudocode Yes Algorithm 1 Control Signal Balancing Algorithm
Open Source Code Yes See https://github.com/Matrix Team-AI/matrix, https://matrixteamai.github.io/pages/The Matrix/ for code data and project page.
Open Datasets Yes The third scenario is specifically designed to assess the effectiveness of The Matrix in embodied AI tasks. For all scenarios, we follow the same training strategy: starting with a pre-trained Di T model, we first perform a warm-up using unlabeled data, followed by fine-tuning the Interactive Module with labeled data. We select 50,000 6-second clips from the DROID dataset, along with per-frame action labels of joint angles for seven joints, to form the training dataset. More details can be found in Appendix Section B.3.
Dataset Splits No The paper mentions training on labeled and unlabeled data, a
Hardware Specification Yes The Matrix is trained on 32x A100 GPUs in one week. All inference processes are conducted on 8x A100 GPUs.
Software Dependencies No The paper mentions several software tools like Cheat Engine software [29], Reshade plugin [30], OBS Recording software [31], FFmpeg [48], Intern VL [49], and Grounding DINOv2 [50], but does not provide specific version numbers for these or for core machine learning libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes All training procedures were executed with an overall batch size of 32 and a learning rate of 1e-5. Mixed-precision training was employed using bfloat16 to enhance computational efficiency. During preprocessing, all video inputs were resized to a resolution of 1280x720 pixels and set to 16 FPS. For sequences exceeding 25,200 frames in length, we used the Deepspeed Ulysses sequence parallelism strategy [46], distributing the sequence across 8 GPUs to manage memory and computational demands effectively. In the initial warm-up stage, we fine-tuned all linear layers of the base Di T model using Low-Rank Adaptation (Lo RA) to tailor the model to the source data distribution [22]. The Lo RA rank was set to 128, and the model was trained for 20,000 steps. This adaptation ensures that the model parameters are suitably adjusted to the characteristics of the unlabeled source dataset before advancing to subsequent training phases. The second stage focuses on training the Interactive Module... This stage was conducted over 20,000 training steps... The third stage involves comprehensive fine-tuning of all model parameters... This extensive fine-tuning was carried out over 60,000 steps... In the final stage, consistency model distillation... we employed a one-stage guided distillation technique [47], incorporating Classifier-Free Guidance (CFG) into the student model. For the Ordinary Differential Equation (ODE) solver within the consistency distillation framework, we utilized the Euler solver with a single-step size of 25/1000. This distillation process was conducted over 10,000 training steps.