Spatio-Temporal Interactive Learning for Efficient Image Reconstruction of Spiking Cameras

Authors: Bin Fan, Jiaoyang Yin, Yuchao Dai, Chao Xu, Tiejun Huang, Boxin Shi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on synthetic and real-captured data show that our approach exhibits excellent performance while maintaining low model complexity.
Researcher Affiliation Academia Bin Fan1 Jiaoyang Yin2,3 Yuchao Dai4 Chao Xu1 Tiejun Huang2,3 Boxin Shi2,3 1Nat l Key Lab of General AI, School of Intelligence Science and Technology, Peking University 2State Key Lab of Multimedia Info. Processing, School of Computer Science, Peking University 3Nat l Eng. Research Ctr. of Visual Technology, School of Computer Science, Peking University 4School of Electronics and Information, Northwestern Polytechnical University
Pseudocode No The paper describes the network architecture and components in detail using prose and diagrams (e.g., Figure 3), but it does not include formal pseudocode blocks or algorithms.
Open Source Code Yes The code is available at https://github.com/Git CVfb/STIR.
Open Datasets Yes We adopt the recently released SREDS dataset [57], which is synthesized based on the REDS dataset [38], for network training.
Dataset Splits No The paper mentions training and testing scenes from the SREDS dataset ('240 training scenes and 30 testing scenes') but does not explicitly specify a separate validation split or how validation was performed.
Hardware Specification Yes All models are trained and tested on a single NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions using the 'Adam optimizer [28]' but does not provide specific version numbers for software components like Python, PyTorch, or CUDA libraries, which are necessary for full reproducibility of software dependencies.
Experiment Setup Yes Our model is trained using the Adam optimizer [28] for 150 epochs with a batch size of 8. The initial learning rate is 0.0001 and decays by a factor of 0.7 every 50 epochs. The temporal length of the input spike stream is 60, i.e., N = 20. The number of pyramid levels is set to 5, i.e., L = 5. In our HSER module, we construct a 5-channel TFP-based explicit representation... as well as an 11-channel Res Net-based implicit representation... Thus, the number of feature channels is 16, 24, 32, 64, and 96, respectively. Besides, 3 groups of motion fields are estimated at the bottom-level pyramid, i.e., G = 3. Spikes and ground truth images are randomly flipped vertically as well as rotated 90 , 180 , or 270 during training.