GenRec: Unifying Video Generation and Recognition with Diffusion Models

Authors: Zejia Weng, Xitong Yang, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the efficacy of Gen Rec for both recognition and generation. In particular, Gen Rec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. Gen Rec also performs the best on class-conditioned image-to-video generation, achieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore, Gen Rec demonstrates extraordinary robustness in scenarios that only limited frames can be observed.
Researcher Affiliation Academia Zejia Weng1,2, Xitong Yang3, Zhen Xing1,2, Zuxuan Wu1,2 , Yu-Gang Jiang1,2 1 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2 Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3 Department of Computer Science, University of Maryland
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No Code will be available at https://github.com/wengzejia1/Gen Rec.
Open Datasets Yes In our experiments, we use the following four datasets: Something-Something V2 (SSV2) [17], Kinetics-400 (K400) [24], UCF-101 [35] and Epic-Kitchen-100 (EK-100) [10].
Dataset Splits No The paper mentions training steps and epochs, but does not specify the explicit training, validation, and test dataset splits (e.g., percentages or sample counts for each split).
Hardware Specification Yes The training is executed on 8 A100s and each contains a batch of 8 samples.
Software Dependencies No The paper describes the software components used but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We initially set the learning rate to 1.0 10 5 and set the total batch size as 32. Only generation loss will be retained for model adaptation on specific datasets. We train 200k steps on EK-100 and UCF, and 300k steps on SSV2 and K400, respectively. Subsequently, we finetune Gen Rec with both generation and recognition losses. The learning rate is set to 1.25 10 5 and decayed to 2.5 10 7 using a cosine decay scheduler. We warm up models with 5 epochs, during which the learning rate is initially set as 2.5 10 7 and linearly increases to the initial learning rate 1.25 10 5. The loss balance ratio γ is set to 10, and the learning rate for the classifier head is ten times higher than the base learning rate. We drop out the conditions 10% of the time for supporting classifier-free guidance [20], and we finetune on K400 for 40 epochs and 30 epochs on other datasets. The training is executed on 8 A100s and each contains a batch of 8 samples. We sample 16 frames for each video.