Latent Image Animator: Learning to Animate Images via Latent Space Navigation

Authors: Yaohui Wang, Di Yang, Francois Bremond, Antitza Dantcheva

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive quantitative and qualitative analysis suggests that our model systematically and significantly outperforms state-of-art methods on Vox Celeb, Taichi and TED-talk datasets w.r.t. generated quality.
Researcher Affiliation Academia Inria, Universit e Cˆote d Azur {yaohui.wang,di.yang,francois.bremond,antitza.dantcheva}@inria.fr
Pseudocode No The paper includes architectural diagrams but does not contain any explicit pseudocode or algorithm blocks.
Open Source Code Yes Source code and pre-trained models are publicly available1. 1https://wyhsirius.github.io/LIA-project/
Open Datasets Yes Our model is trained on the datasets Vox Celeb, Taichi HD and TED-talk. We follow the pre-processing method in (Siarohin et al., 2019) to crop frames into 256 256 resolution for quantitative evaluation. ... Vox Celeb (Nagrani et al., 2019)... Tai Chi HD (Siarohin et al., 2019)... TED-talk is a new dataset proposed in MRAA (Siarohin et al., 2021).
Dataset Splits No The paper specifies training and test sets for datasets (e.g., 'Vox Celeb contains a training set of 17928 videos and a test set of 495 videos.') but does not explicitly provide details about a separate validation set split or its size/percentage.
Hardware Specification Yes All models are trained on four 16G NVIDIA V100 GPUs.
Software Dependencies No The paper mentions 'Our model is implemented in Py Torch (Paszke et al., 2019)' but does not provide specific version numbers for PyTorch or any other software dependencies, such as CUDA or other libraries.
Experiment Setup Yes The total batch size is 32 with 8 images per GPU. We use a learning rate of 0.002 to train our model with the Adam optimizer (Kingma & Ba, 2014). The dimension of all latent codes, as well as directions in Dm is set to be 512. In our loss function, we use λ = 10 in order to penalize more on the perceptual loss.