Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

INR-V: A Continuous Representation Space for Video-based Generative Tasks

Authors: Bipasha Sen, Aditya Agarwal, Vinay P Namboodiri, C.V. Jawahar

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showcasing the potential of the proposed representation space. Experimental Setup: We perform our experiments on (1) How2Sign-Faces Duarte et al. (2020), (2) Sky Timelapse Xiong et al. (2017), (3) Moving-MNIST Srivastava et al. (2015), and (4) Rainbow Jelly Skorokhodov et al. (2021). Table 1: Quantitative metrics on reconstruction quality.
Researcher Affiliation Academia Bipasha Sen EMAIL IIIT Hyderabad Aditya Agarwal EMAIL IIIT Hyderabad Vinay P Namboodiri EMAIL University of Bath C. V. Jawahar EMAIL IIIT Hyderabad
Pseudocode No The paper describes the methodology using mathematical equations and descriptive text, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The codebase, dataset, and pretrained models can be found at https://skymanaditya1.github.io/INRV
Open Datasets Yes Experimental Setup: We perform our experiments on (1) How2Sign-Faces Duarte et al. (2020), (2) Sky Timelapse Xiong et al. (2017), (3) Moving-MNIST Srivastava et al. (2015), and (4) Rainbow Jelly Skorokhodov et al. (2021).
Dataset Splits Yes We modify How2Sign to How2Sign-Faces by cropping the face region out of all the videos and randomly sample 10,000 talking head videos, each of at least 25 frames, of dimension 128 128. Rainbow Jelly is a single underwater video capturing colorful jellyfishes. The video is first extracted into frames which are then divided into videos of 25 frames each, making a total of 34,526 videos. To quantify the performance of INR-V, we prepare a comparison set by randomly sampling 256 videos outside of the training set.
Hardware Specification Yes All experiments are performed on 2 NVIDIA-GTX 2080-ti GPUs with 12 GB memory each.
Software Dependencies No The paper mentions 'Adam optimizer is used' but does not specify version numbers for any software, libraries, or programming languages.
Experiment Setup Yes Adam optimizer is used with a learning rate of 1e 4 during training and 1e 2 during inversion tasks. No scheduler is used. Progressive training is done at a power of 10 where ith stage is made of min(10i, N) examples. i = 0 . . . K such that 10K+1 < N + 1, where N is the total number of training samples. Each stage except the last stage is trained until the reconstruction error reaches a threshold of 1e 3. The implicit neural representation fθ is an MLP with three 256-dimensional hidden layers. The hypernetwork dΩis a set of MLPs. Each MLP predicts the weights for a single hidden layer and the output layer of fθ. Each MLP has three 256-dimensional hidden layers.