Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling
Authors: Shuhong Zheng, Ashkan Mirzaei, Igor Gilitschenski
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/. |
| Researcher Affiliation | Collaboration | 1University of Toronto 2Vector Institute 3Snap Inc. |
| Pseudocode | No | The paper describes the methodology in detailed steps across sections 3.2, 3.3, and 3.4, and provides a pipeline diagram in Figure 2, but it does not include a distinct block labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: While the current submission does not contain code implementation, we have included detailed instructions in the appendix to provide guidance for reproducing the main experimental results. Also, we promise that we will open-source the data and code after paper acceptance. |
| Open Datasets | No | Starting from the original Dream Booth [75] dataset that focuses on subject-driven image generation, we construct a Dream Booth-Dynamic dataset. It is based on animatable subjects in the original Dream Booth dataset and will be used for subject-driven image-to-3D and video-to-4D generation. More details about the dataset are provided in App. B in the appendix. [...] Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: At the submission time, we have not released the new assets used in the paper. We will well document our data and model at the time we release our data after acceptance. |
| Dataset Splits | No | We randomly select 10 samples from the Dream Booth-Dynamic dataset. Note that we do not explicitly tell the participants that we are working on improving subject-driven generation, which fosters the fairness in user study and reduces the underlying bias of the subjective preference towards our method. |
| Hardware Specification | Yes | Our method takes around 100 mins on a single NVIDIA A100 GPU (when considering the memory consumption, an NVIDIA Quadro RTX 6000 or RTX 4090 with 24GB will suffice) |
| Software Dependencies | No | With the pretrained stable diffusion inpainting model in hand, we inject Lo RA [22] weights in the pretrained model and finetune with randomly generated binary mask mi for inpainting. The loss calculation will only be conducted on the valid regions mv as [...] Then, we leverage the video tracking model Co Tracker [34] to find the correspondence between the source view and the target views. |
| Experiment Setup | Yes | More specifically, when inpainting the farther viewpoints within 90 , the similar operation of backward tracking in Sec. 3.2 will be applied to track from the queried viewpoints to this anchor viewpoint. It helps further reduce the area that needs to be infilled, therefore lowering the difficulty for the model to inpaint the far away viewpoints. The reasons for choosing 20 as the sweet spot for the anchor viewpoint in tracking are (1) compared to the source viewpoint at 0 , it has certain exploration on unseen regions that can provide larger known regions for the farther viewpoints; (2) when inpainting the 20 viewpoint, since it does not have significant shift from the original training image, the inpainted results are relatively decent and reliable. Therefore, this sweet spot strikes a balance between "exploration and exploitation" of the given source view observation. After the 90 viewpoint is inpainted, it serves as the next anchor point for inpainting the rest of the viewpoints within 90 180 . As we do not want significant change on the original structure, we perform denoising with the first 30% of the denoising schedule following similar practice as [33]. Also in App. D.1: "batch size is 16, Lo RA rank is 8, learning rate is 2e-4 for U-Net and 4e-5 for the text encoder" and "Each training stage consists of 300 training steps before advancing to the next stage." |