Exocentric-to-Egocentric Video Generation

Authors: Jia-Wei Liu, Weijia Mao, Zhongcong XU, Jussi Keppo, Mike Zheng Shou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Exo2Ego-V significantly outperforms SOTA approaches on 5 categories from the Ego-Exo4D dataset with an average of 35% in terms of LPIPS.
Researcher Affiliation Academia Jia-Wei Liu1 , Weijia Mao1 , Zhongcong Xu1, Jussi Keppo2, Mike Zheng Shou1B 1Show Lab, 2National University of Singapore
Pseudocode No The paper describes the model architecture and optimization process, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code No Our code and model will be made available on https://github.com/showlab/Exo2Ego-V.
Open Datasets Yes We extensively evaluate our Exo2Ego-V on 5 categories of skilled human activities from the challenging Ego-Exo4D [15] dataset and H2O dataset [26].
Dataset Splits No The paper specifies training and testing splits (e.g., '80% of action clips as our train set and the remaining 20% unseen action clips as test set'), but it does not explicitly mention a separate validation split or how it was derived.
Hardware Specification Yes We first train the translation prior with 500K iterations on a single A100 GPU for 36 hours, and then optimize our Exo2Ego spatial appearance translation with 500K iterations on 8 A100 GPUs for 48 hours, and finally finetune our temporal motion module with 100K iterations on 8 A100 GPUs for 40 hours, all using the Py Torch [40] deep learning framework.
Software Dependencies No The paper mentions 'all using the Py Torch [40] deep learning framework,' but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes We set the learning rate of the multi-view exocentric encoder and the egocentric diffusion model as 0.00001, and we set the learning rate of view translation prior as 0.0001. We set the the number of temporal frames to 8 and spatial resolution to 480 270 and 256 256 for exocentric and egocentric videos, respectively.