Recurrent Video Restoration Transformer with Guided Deformable Attention

Authors: Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, Luc V Gool

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime. The codes are available at https://github.com/Jingyun Liang/RVRT.
Researcher Affiliation Collaboration 1Computer Vision Lab, ETH Zurich, Switzerland 2Meta Inc. 3University of Wurzburg, Germany
Pseudocode No The paper describes the methodology using text and diagrams but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes The codes are available at https://github.com/Jingyun Liang/RVRT.
Open Datasets Yes For video SR, we consider two settings: bicubic (BI) and blur-downsampling (BD) degradation. For BI degradation, we train the model on two different datasets: REDS [53] and Vimeo-90K [87]...
Dataset Splits No The paper mentions training on REDS and Vimeo-90K, and testing on their respective test sets (REDS4, Vimeo-90K-T), but does not explicitly specify a separate validation dataset split.
Hardware Specification No The paper reports model size, testing memory, and runtime, but does not specify the hardware (e.g., GPU/CPU models, memory) on which the experiments were conducted.
Software Dependencies No The paper mentions using specific components like Charbonnier loss [12], Adam optimizer [33], Cosine Annealing scheme [52], and Spy Net [58, 56], but does not provide specific version numbers for any software libraries or frameworks used in the implementation.
Experiment Setup Yes For shallow feature extraction and HQ frame reconstruction, we use 1 RSTB that has 2 swin transformer layers. For recurrent feature refinement, we use 4 refinement modules with a clip size of 2, each of which has 2 MRSTBs with 2 modified swin transformer layers. For both RSTB and MRSTB, spatial attention window size and head number are 8 8 and 6, respectively. We use 144 channels for video SR and 192 channels for deblurring and denoising. In GDA, we use 12 deformable groups and 12 deformable heads with 9 candidate locations... In training, we randomly crop 256 256 HQ patches and use different video lengths for different datasets... Adam optimizer [33] with default setting is used to train the model for 600,000 iterations when the batch size is 8. The learning rate is initialized as 4 10 4 and deceased with the Cosine Annealing scheme [52]. To stabilize training, we initialize Spy Net [58, 56] with pretrained weights, fix it for the first 30,000 iterations and reduce its learning rate by 75%.