Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos
Authors: Zhen Xu, Zhengqin Li, Zhao Dong, Xiaowei Zhou, Richard Newcombe, Zhaoyang Lv
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations across datasets and domains show that 4DGT achieves comparable reconstruction quality to optimization-based methods while being three orders of magnitude faster, making it practical for long video reconstruction. Compared to prior methods that only train on synthetic object-level data [43], 4DGT generalizes better to complex real-world dynamics. Compared to the per-frame prediction pipeline [25], it also exhibits emergent motion properties. In summary, we make the following technical contributions: We introduce 4DGT, a novel 4DGS transformer trained on posed monocular videos at scale, which produces consistent 4D video reconstructions in seconds at inference. We propose a training strategy to densify and prune space-time pixel-aligned Gaussians, reducing 80% of predictions, achieving 16 higher sampling rate during training and a 5 speed-up in rendering. We design a multi-level attention module to efficiently fuse space-time tokens, further reducing training time by half. Our experiments demonstrate strong scalability of 4DGT across real-world domains using mixed training datasets and can outperform the previous Gaussian network significantly. The performance of 4DGT is on par with optimization-based methods in accuracy in cross-domain videos recorded by similar devices used in training, while being 3 orders of magnitude faster. |
| Researcher Affiliation | Collaboration | Zhen Xu1,2,* Zhengqin Li1 Zhao Dong1 Xiaowei Zhou2 Richard Newcombe1 Zhaoyang Lv1 1Reality Labs Research, Meta 2Zhejiang University |
| Pseudocode | No | The paper describes methods and architectures in detail using prose, mathematical formulas, and diagrams (e.g., Figure 2 for an overview of the method). However, it does not contain a specific section or block explicitly labeled "Pseudocode" or "Algorithm" with structured, code-like steps. |
| Open Source Code | No | We cannot provide the code implementation at the submission time due to legal process. We will intend to release our model upon the acceptance of the paper and legal review. |
| Open Datasets | Yes | Training Datasets. We use the following real-world monocular videos with high-quality calibrations: Project Aria datasets with closed-loop trajectories: the Ego Exo4D [12], Nymeria [29], Hot3D [2] and Aria Everyday Activities (AEA) [28]. Video data with COLMAP [44] camera parameters: Epic-Fields [50, 5] and Cop3D [46]. Phone videos with ARKit camera poses: ARKit Track [71]. Evaluation datasets. We use the synthetic rendering provided in ADT [34] datasets, which provides metric ground truth depth. To evaluate cross-domain generalization, we use Dy Check [11] (Dy C) datasets and the dynamic scene in TUM-SLAM [48] (TUM) to evaluate novel view synthesis. |
| Dataset Splits | Yes | We train 4DGT using segments of W = 128 consecutive frames from the monocular video and subsample every 8 frames as input, resulting in N = 16 input frames. Notably, for the second stage training where we apply techniques mentioned in section 3.2 and section 3.2, we increase the number of input frames to N = 64. After obtaining all Gaussian parameters {Gi,j} from each of the N input frames, we render them to all W = 128 images for self-supervision and compute the MSE loss. Additionally, we add the perceptual LPIPS loss [17] Llpips and SSIM loss [56] Lssim for better perceptual quality. For each dataset used in training [28, 12, 2, 71, 29, 46, 50], we select 99.15% of the sequences as the training set and hold out the rest. For the datasets used in evaluation: ADT [34]: We select 4 subsequences for validating the reconstruction performance: Apartment_release_multiuser_cook_seq141_M1292 Apartment_release_multiskeleton_party_seq114_M1292 Apartment_release_meal_skeleton_seq135_M1292 Apartment_release_work_skeleton_seq137_M1292 Dy Check [11]: We use all 6 sequences with 3 views, and follow [53, 11] to apply the covisibility mask before computing metrics on novel views: apple, block, space-out, spin, paper-windmill, teddy TUM [48]: We seclet 3 subsequences for evaluation: rgbd_dataset_freiburg2_desk_with_person rgbd_dataset_freiburg3_walking_halfsphere rgbd_dataset_freiburg3_sitting_halfsphere Ego Exo4D [12]: We select 3 subsequences from the hold-out sequences: cmu_bike01_2, sfu_cooking015_2, uniandes_bouldering_003_10 Nymeria [29]: We select 2 sequences from the hold-out set: 20230607_s0_james_johnson_act1_7xwm28 20230612_s1_christina_jones_act0_u2r0z8 AEA [28]: We select the loc5_script5_seq7_rec1 sequence from the hold-out set. Hot3D [2]: We select the P0020_ff537251 sequence from the hold-out set. The testing sequences from Ego Exo4D, AEA, and Hot3D are denoted as Aria in all comparisons. All comparison experiments are conducted on 128-frame subsequences of the monocular videos, with 64 frames used as input and the remaining 64 frames used for testing |
| Hardware Specification | Yes | With 64 Nvidia H100 GPUs, the first stage training takes roughly 9 days and the second stage training takes roughly 6 days. For all other experiments on inference speed, we use a single 80 GB A100 GPU. |
| Software Dependencies | No | We implement 4DGT in Py Torch framework [38]. We employ Flash Attention V3 [45] and the GSplat Rasterizer [67] for efficient attention and Gaussian optimization respectively. |
| Experiment Setup | Yes | We use the Adam W optimizer [27] with a learning rate of 5e 4 and a weight decay of 0.05. For the second stage training, the learning rate is set to 1e 5. Additionally, we linearly warm-up the learning rate of each stage in the first 2500 steps and then apply the cosine decaying schedule [26] for the remaining steps. During the second strange training, we additionally augment the input and output to the network by varying the aspect ratio and field of view of the images. Specifically, we randomly sample an aspect ratio from the uniform distribution on [ 1 1] and a field of view ratio on the original image on [30%, 100%]. We train our reconstruction model 100k iterations for the first stage and 30k iterations for the second stage, using a total batch size of 64. We set λlpips = 2.0, λssim = 0.2, λv = 1.0, λω = 1.0, λl = 1.0, λD = 0.1 and λN = 0.01 for all experiments. All weights for the regularization losses are warmed up linearly from 0 to their final values during the first 2500 iterations of training. |