Exocentric-to-Egocentric Video Generation
Authors: Jia-Wei Liu, Weijia Mao, Zhongcong XU, Jussi Keppo, Mike Zheng Shou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Exo2Ego-V significantly outperforms SOTA approaches on 5 categories from the Ego-Exo4D dataset with an average of 35% in terms of LPIPS. |
| Researcher Affiliation | Academia | Jia-Wei Liu1 , Weijia Mao1 , Zhongcong Xu1, Jussi Keppo2, Mike Zheng Shou1B 1Show Lab, 2National University of Singapore |
| Pseudocode | No | The paper describes the model architecture and optimization process, but it does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | No | Our code and model will be made available on https://github.com/showlab/Exo2Ego-V. |
| Open Datasets | Yes | We extensively evaluate our Exo2Ego-V on 5 categories of skilled human activities from the challenging Ego-Exo4D [15] dataset and H2O dataset [26]. |
| Dataset Splits | No | The paper specifies training and testing splits (e.g., '80% of action clips as our train set and the remaining 20% unseen action clips as test set'), but it does not explicitly mention a separate validation split or how it was derived. |
| Hardware Specification | Yes | We first train the translation prior with 500K iterations on a single A100 GPU for 36 hours, and then optimize our Exo2Ego spatial appearance translation with 500K iterations on 8 A100 GPUs for 48 hours, and finally finetune our temporal motion module with 100K iterations on 8 A100 GPUs for 40 hours, all using the Py Torch [40] deep learning framework. |
| Software Dependencies | No | The paper mentions 'all using the Py Torch [40] deep learning framework,' but does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We set the learning rate of the multi-view exocentric encoder and the egocentric diffusion model as 0.00001, and we set the learning rate of view translation prior as 0.0001. We set the the number of temporal frames to 8 and spatial resolution to 480 270 and 256 256 for exocentric and egocentric videos, respectively. |