ViSt3D: Video Stylization with 3D CNN
Authors: Ayush Pande, Gaurav Sharma
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide results on this test dataset comparing the proposed method with 2D stylization methods applied frame by frame. We show successful stylization with 3D CNN for the first time, and obtain better stylization in terms of texture cf. the existing 2D methods. We also give quantitative results based on optical flow errors, comparing the results of the proposed method to that of the state-of-the-art 2D stylization methods. We now describe the experimental setup, and the qualitative and quantitative results comparing the proposed method with existing state-of-the-art approaches. |
| Researcher Affiliation | Collaboration | Ayush Pande IIT Kanpur ayushp@cse.iitk.ac.in Gaurav Sharma Tensor Tour & IIT Kanpur gaurav@tensortour.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | No | The paper provides a "Project page: https://ayush202.github.io/projects/Vi St3D.html" but does not explicitly state that this page contains the source code for the methodology. The provided instructions consider a project page, personal homepage, or high-level overview page as not providing concrete access to source code unless explicitly stated to contain it. |
| Open Datasets | Yes | Since a dataset was not available for the task of studying video stylization, we also propose a large-scale dataset with 10, 000 content clips curated from public benchmark Sports1M [14], paired with the train set of style images from Wiki Art dataset [20] which are used to train 2D stylization methods. We downloaded 45, 574 videos from the first 100, 000 URLs in the Sports1M dataset [14] (as many links were now defunct, and some downloads failed). In addition to the content clips, we use images from the Wiki Art [20] as the style images, as has been done by the image stylization methods as well. |
| Dataset Splits | No | The paper mentions using a dataset for "training" and "testing", but it does not specify explicit training/validation/test dataset splits with percentages or sample counts for a validation set. It only states "10, 000 content clips generated... as part of the dataset for training" without detailing how these are further split for validation. |
| Hardware Specification | Yes | Ada Att N took 14 seconds and 8GB of GPU memory, while our proposed method took 60 seconds and 16GB of GPU memory on a machine with Intel Core i9-10900X processor and Nvidia RTX A4000 GPU. |
| Software Dependencies | No | The paper mentions software components like "Flow Net2.0 [12]" and "VGG-19 network" but does not provide specific version numbers for these or any other libraries, frameworks, or programming languages used (e.g., Python version, PyTorch/TensorFlow version). |
| Experiment Setup | Yes | Parameters. In the first phase of training, we train the encoder and decoder for 10k iterations. We used Adam optimizer with a learning rate of 10 4 with a decay rate of 5 10 5 after every iteration for this phase of training. In the second phase of training, we train the decoder, appearance subnets for 5k iterations. We used Adam optimizer with a learning rate of 10 4 with a decay rate of 5 10 5 after every iteration for this phase of training. In the third phase of training, we train the decoder and entangle subnet for 100k iterations. We used Adam optimizer with a learning rate of 10 4 with a decay rate of 5 10 5 after every iteration for this phase of training. In the final phase training, we train the decoder for 40k iterations with content, style loss where values of λc, λs is 1, 2 respectively. After this, we train the decoder with content, style, temporal losses for 160k iterations where values of λcontent, λstyle, λtemporal being 1, 2, 10 respectively. And finally, we fine-tune the decoder for 40k iterations with content, style, temporal, intra-clip loss where values of λcontent, λstyle, λtemporal, λintra being 1, 2, 10, 10 respectively. Input. We set clip size K to 16. While training we extract random 128 128 size same patch from consecutive frames of a video clip in case of content clip. For, style clip we extract a random 128 128 patch from the style image and repeat it 16 times which is equal to the length of the content clip. |