Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
EchoShot: Multi-Shot Portrait Video Generation
Authors: Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan, Yachuang Feng, Bing Deng, Jieping Ye
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations demonstrate that Echo Shot achieves superior identity consistency as well as attribute-level controllability in multi-shot portrait video generation. Notably, the proposed framework demonstrates potential as a foundational paradigm for general multi-shot video modeling. Project page: https://johnneywang.github.io/Echo Shot-webpage. |
| Researcher Affiliation | Collaboration | Jiahao Wang1 Hualian Sheng2 Sijia Cai2, Weizhan Zhang1, Caixia Yan1 Yachuang Feng2 Bing Deng2 Jieping Ye2 1School of Computer Science and Technology, MOEKLINNS, Xi an Jiaotong University 2Alibaba Cloud Computing EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes mathematical formulations and mechanisms like Tc Ro PE and Ta Ro PE, but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | We submitted codes in the supplementary materials. We will release the model and the dataset officially online with detailed instructions after preparations. |
| Open Datasets | No | To facilitate model training within multi-shot scenarios, we construct Portrait Gala, a large-scale and high-fidelity human-centric video dataset featuring cross-shot identity consistency and fine-grained captions such as facial attributes, outfits, and dynamic motions. |
| Dataset Splits | No | The training dataset consists of one-third one-to-one data and two-thirds many-to-one data, and the shot number S varies from 1 to 4. The length of each shot is randomly sampled. |
| Hardware Specification | Yes | All the training is carried out on NVIDIA A100 80GB GPUs. The MT2V pretraining takes 3,500 GPU hours while the PMT2V and Inf T2V take additional 2000 and 1000 GPU hours, respectively. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers. |
| Experiment Setup | Yes | Throughout the three tasks, we set the resolution to 832 480 and use a fixed 125 frames, which is equivalent to 7.8 seconds in the real world. We set the phase shift scale j to 4 and the mismatch suppression scale k to 6. The training dataset consists of one-third one-to-one data and two-thirds many-to-one data, and the shot number S varies from 1 to 4. The length of each shot is randomly sampled. All the training is driven by the standard RF loss. The MT2V pretraining takes about 3,500 NVIDIA A100 GPU hours. The PMT2V model and Inf T2V model are fine-tuned based on the MT2V weights. |