GenRec: Unifying Video Generation and Recognition with Diffusion Models
Authors: Zejia Weng, Xitong Yang, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the efficacy of Gen Rec for both recognition and generation. In particular, Gen Rec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. Gen Rec also performs the best on class-conditioned image-to-video generation, achieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore, Gen Rec demonstrates extraordinary robustness in scenarios that only limited frames can be observed. |
| Researcher Affiliation | Academia | Zejia Weng1,2, Xitong Yang3, Zhen Xing1,2, Zuxuan Wu1,2 , Yu-Gang Jiang1,2 1 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2 Shanghai Collaborative Innovation Center of Intelligent Visual Computing 3 Department of Computer Science, University of Maryland |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | Code will be available at https://github.com/wengzejia1/Gen Rec. |
| Open Datasets | Yes | In our experiments, we use the following four datasets: Something-Something V2 (SSV2) [17], Kinetics-400 (K400) [24], UCF-101 [35] and Epic-Kitchen-100 (EK-100) [10]. |
| Dataset Splits | No | The paper mentions training steps and epochs, but does not specify the explicit training, validation, and test dataset splits (e.g., percentages or sample counts for each split). |
| Hardware Specification | Yes | The training is executed on 8 A100s and each contains a batch of 8 samples. |
| Software Dependencies | No | The paper describes the software components used but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We initially set the learning rate to 1.0 10 5 and set the total batch size as 32. Only generation loss will be retained for model adaptation on specific datasets. We train 200k steps on EK-100 and UCF, and 300k steps on SSV2 and K400, respectively. Subsequently, we finetune Gen Rec with both generation and recognition losses. The learning rate is set to 1.25 10 5 and decayed to 2.5 10 7 using a cosine decay scheduler. We warm up models with 5 epochs, during which the learning rate is initially set as 2.5 10 7 and linearly increases to the initial learning rate 1.25 10 5. The loss balance ratio γ is set to 10, and the learning rate for the classifier head is ten times higher than the base learning rate. We drop out the conditions 10% of the time for supporting classifier-free guidance [20], and we finetune on K400 for 40 epochs and 30 epochs on other datasets. The training is executed on 8 A100s and each contains a batch of 8 samples. We sample 16 frames for each video. |