reproducibilityindex.ai

Recurrent Video Restoration Transformer with Guided Deformable Attention

Authors: Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, Luc V Gool

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on video super-resolution, deblurring, and denoising show that the proposed RVRT achieves state-of-the-art performance on benchmark datasets with balanced model size, testing memory and runtime. The codes are available at https://github.com/Jingyun Liang/RVRT.
Researcher Affiliation	Collaboration	1Computer Vision Lab, ETH Zurich, Switzerland 2Meta Inc. 3University of Wurzburg, Germany
Pseudocode	No	The paper describes the methodology using text and diagrams but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The codes are available at https://github.com/Jingyun Liang/RVRT.
Open Datasets	Yes	For video SR, we consider two settings: bicubic (BI) and blur-downsampling (BD) degradation. For BI degradation, we train the model on two different datasets: REDS [53] and Vimeo-90K [87]...
Dataset Splits	No	The paper mentions training on REDS and Vimeo-90K, and testing on their respective test sets (REDS4, Vimeo-90K-T), but does not explicitly specify a separate validation dataset split.
Hardware Specification	No	The paper reports model size, testing memory, and runtime, but does not specify the hardware (e.g., GPU/CPU models, memory) on which the experiments were conducted.
Software Dependencies	No	The paper mentions using specific components like Charbonnier loss [12], Adam optimizer [33], Cosine Annealing scheme [52], and Spy Net [58, 56], but does not provide specific version numbers for any software libraries or frameworks used in the implementation.
Experiment Setup	Yes	For shallow feature extraction and HQ frame reconstruction, we use 1 RSTB that has 2 swin transformer layers. For recurrent feature refinement, we use 4 refinement modules with a clip size of 2, each of which has 2 MRSTBs with 2 modified swin transformer layers. For both RSTB and MRSTB, spatial attention window size and head number are 8 8 and 6, respectively. We use 144 channels for video SR and 192 channels for deblurring and denoising. In GDA, we use 12 deformable groups and 12 deformable heads with 9 candidate locations... In training, we randomly crop 256 256 HQ patches and use different video lengths for different datasets... Adam optimizer [33] with default setting is used to train the model for 600,000 iterations when the batch size is 8. The learning rate is initialized as 4 10 4 and deceased with the Cosine Annealing scheme [52]. To stabilize training, we initialize Spy Net [58, 56] with pretrained weights, fix it for the first 30,000 iterations and reduce its learning rate by 75%.