ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction

Authors: Gengshan Yang, Deqing Sun, Varun Jampani, Daniel Vlasic, Forrester Cole, Ce Liu, Deva Ramanan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Vi SER compares favorably against prior work on challenging videos of humans with loose clothing and unusual poses as well as animal videos from DAVIS and YTVOS.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Google Research 3Argo AI 4Microsoft Azure AI
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions 'Vi SER-webpage' on the title page, which may contain a demo or further information, but it does not provide an unambiguous statement of open-source code release for the methodology described in the paper, nor a direct link to a code repository.
Open Datasets Yes To evaluate Vi SER on long-videos, we construct an athletic video dataset that is challenging due to loose clothing and unusual body poses. It consists of four videos from DAVIS [31] and three ballet videos. All videos are segmented and manually annotated with keypoints following the MSCOCO format [24]. and We use BADJA [2] to evaluate Vi SER on animal videos including camel, cow, dog, bear and horse. and We curate a set of seven videos of different elephants from YTVOS [47] for multi-video shape and correspondence recovery.
Dataset Splits No The paper uses established datasets like DAVIS, MSCOCO format, BADJA, and YTVOS for evaluation, and mentions using keypoint annotations for evaluation purposes. However, it does not explicitly provide specific details about how these datasets were formally split into training, validation, and testing sets for the experiments conducted in this paper.
Hardware Specification Yes Vi SER is only suitable to offline applications as it takes about several hours to process a 80-frame video on one NVIDIA P100 GPU.
Software Dependencies No The paper mentions several software components and models like U-Net, AdamW, ResNet-18, ImageNet, and AlexNet, but does not provide specific version numbers for any of them or for any underlying programming languages or libraries.
Experiment Setup Yes We use the Adam W [27] optimizer with a batch of 4 consecutive image pairs. We reconstruct a long video sequence in an incremental manner similar to classic Sf M. First, we use an initial set of around 20 consecutive frames to initialize the shape and pixel surface embeddings. The initial set is selected such that the viewpoint coverage is large enough. Then we gradually add in new frames. When a new frame is added, we first apply the 2D cycle loss Lreproj to optimize its articulations, and then jointly optimize all frames with all losses.