BiMatting: Efficient Video Matting via Binarization

Authors: Haotong Qin, Lei Ke, Xudong Ma, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Xianglong Liu, Fisher Yu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that Bi Matting outperforms other binarized video matting models, including state-of-the-art (SOTA) binarization methods, by a significant margin. Our approach even performs comparably to the full-precision counterpart in visual quality.
Researcher Affiliation Academia 1Beihang University 2ETH Zürich 3HKUST 4Dartmouth College
Pseudocode No The paper describes processes and operations using mathematical equations and textual explanations, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code and models are released at https://github.com/htqin/Bi Matting.
Open Datasets Yes Our extensive experiments on fundamental tasks across Video Matte240K (VM) [10], Distinctions-646 (D646) [29], and Adobe Image Matting (AIM) [30] datasets demonstrate that the advantages of Bi Matting are task-independent.
Dataset Splits No The paper mentions training on the VM dataset and evaluating on VM, D646, and AIM datasets, and discusses batch size and epoch numbers. However, it does not explicitly state specific percentages or counts for training, validation, and test splits, nor does it specify how validation data was partitioned from the main datasets beyond general training stages.
Hardware Specification Yes All stages of our experiments use batch size 4 splits across 4 Nvidia A100 GPUs.
Software Dependencies No The complete network is constructed and trained using Py Torch [50]. However, no specific version number for PyTorch or other software dependencies is provided.
Experiment Setup Yes Stage 1 involves training on the low-resolution VM dataset for 20 epochs without DGF, with T = 15 frames for quick updates. The SBB backbone s learning rate is set as 1e 4 and the rest as 2e 4. Additionally, the input resolution h, w is sampled independently from 256-512px for improving robustness. In Stage 2, the network is trained with T = 50, with halved learning rate and 2 more epochs to enable learning of long-term dependencies. In Stage 3, the DGF module is attached, and 1 epoch is trained on both low-resolution long and high-resolution short sequences from the VM dataset. The low-resolution pass is T = 40, h and w are the same with stage 1 without DGF, while the high-resolution pass employs DGF with downsample factor s = 0.25, ˆT = 6, and ˆh, ˆw (1024, 2048). The learning rate of DGF is 2e 4, and that of the rest is 1e 5. In Stage 4, the network is trained for 5 epochs on D646 and AIM, increasing the decoder s learning rate to 5e 5.