Flow-Guided Sparse Transformer for Video Deblurring

Authors: Jing Lin, Yuanhao Cai, Xiaowan Hu, Haoqian Wang, Youliang Yan, Xueyi Zou, Henghui Ding, Yulun Zhang, Radu Timofte, Luc Van Gool

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate that our proposed FGST outperforms state-of-the-art (SOTA) methods on both DVD and GOPRO datasets and yields visually pleasant results in real video deblurring.
Researcher Affiliation Collaboration 1Shenzhen International Graduate School, Tsinghua University 2Huawei Noah s Ark Lab 3ETH Z urich.
Pseudocode No The paper describes the proposed method in text and figures but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes https:// github.com/linjing7/VR-Baseline
Open Datasets Yes DVD. The DVD (Su et al., 2017) dataset consists of 71 videos with 6,708 blurry-sharp image pairs.
Dataset Splits No The paper explicitly states train/test splits for the datasets (DVD: 61 videos train, 10 videos test; GOPRO: 2:1 train/test) but does not mention a distinct validation split.
Hardware Specification Yes The models are trained with 8 V100 GPUs.
Software Dependencies No The paper mentions 'Py Torch' and 'SPy Net' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We implement FGST in Py Torch. We adopt a pre-trained SPy Net (Ranjan et al., 2017) as the optical flow estimator. All the modules are trained with the Adam (Kingma & Ba, 2015) optimizer (β1 = 0.9 and β2 = 0.999) for 600 epochs. The initial learning rate is set to 2 × 10−4 and 2.5 × 10−5 respectively for the deblurring model and optical flow estimator. The learning rate is halved every 200 epochs during the training procedure. Patches at the size of 256 × 256 cropped from training frames are fed into the models. The batch size is 8. The temporal radius r of the neighboring frames is set to 1. The sequence length T is set to 9 in training and the whole video length in testing. The horizontal and vertical flips are performed for data augmentation. Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) (Wang et al., 2004) are adopted as the evaluation metrics. The models are trained with 8 V100 GPUs. L1 loss between the restored and GT videos is used for supervision.