Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Authors: Boyang Wang, Xuweiyi Chen, Matheus Gadelha, Zezhou Cheng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation shows that our proposed approach significantly outperforms existing baselines.
Researcher Affiliation Collaboration 1University of Virginia 2Adobe Research
Pseudocode No The paper describes methods and architectures but does not include an explicit pseudocode or algorithm block.
Open Source Code Yes The pre-processing code and metadata will be released. We will release data curation instructions and code.
Open Datasets Yes The training dataset we use includes Open Vid-1M [38], Vid Gen-1M [51], and subset of Webvid10M [4]. The specific data and filtering statistics can be found in the appendix. We use the reserved subset of the Open Vid-1M dataset as our evaluation test set for Frame In-N-Out curation. Our evaluation is composed of two parts: Frame Out and Frame In with identity (ID) reference. Though we don't require perfect Frame In and Frame Out patterns in our training scenarios, for the expression of a fair intention, we set the setting to the hardest level in the curation of the evaluation test set. In this way, we curate 183 and 189 cases for ideal Frame In and Frame Out as evaluation datasets. All Frame In evaluation datasets will come with one and only one ID reference image. The benchmark will be released for future study.
Dataset Splits Yes The training dataset we use includes Open Vid-1M [38], Vid Gen-1M [51], and subset of Webvid10M [4]. The specific data and filtering statistics can be found in the appendix. We use the reserved subset of the Open Vid-1M dataset as our evaluation test set for Frame In-N-Out curation.
Hardware Specification Yes The authors acknowledge the Adobe Research Gift, the University of Virginia Research Computing and Data Analytics Center, Advanced Micro Devices AI and HPC Cluster Program, Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, and National Artificial Intelligence Research Resource (NAIRR) Pilot for computational resources, including the Anvil supercomputer (National Science Foundation award OAC 2005632) at Purdue University and the Delta and Delta AI advanced computing resources (National Science Foundation award OAC 2005572).
Software Dependencies Yes Automatic Captioning: to obtain high-quality paired text prompts, we discard dataset-provided captions and generate new ones by QWen2.5-32B-VL-Instruct [73]... We apply tracking from Co Tracker3 [26]... For valid Frame In cases, we employ SAM2 [42] to extract the object mask... In the Panoptic Segmentation, we consider 22 objects of COCO dataset [34] detected by One Former [21] as the identity of interest...
Experiment Setup Yes Our training is on a total batch size of 8 for 32K and 50K iterations in two stages, respectively. The training resolution, which is also the canvas resolution, is 384 x 480 for two stages. All the video is curated, processed, and fetched at 12 FPS standards. We apply the learning rate warmup for each stage of training in the first 400 steps. The learning rate is 2e-5. Our inference step is 50 with classifier-free guidance [19]. The first frame and text dropout ratio is 5% each in the training to augment the classifier-free guidance in the inference.