Image Fusion via Vision-Language Model

Authors: Zixiang Zhao, Lilun Deng, Haowen Bai, Yukun Cui, Zhipeng Zhang, Yulun Zhang, Haotong Qin, Dongdong Chen, Jiangshe Zhang, Peng Wang, Luc Van Gool

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. We also propose a vision-language dataset containing Chat GPT-generated paragraph descriptions for the eight image fusion datasets across four fusion tasks, facilitating future research in vision-language model-based image fusion.
Researcher Affiliation Academia 1Xi an Jiaotong University, China 2ETH Z urich, Switzerland 3Northwestern Polytechnical University, China 4Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China 5Heriot-Watt University, United Kingdom 6KU Leuven, Belgium 7INSAIT, Bulgaria.
Pseudocode No The paper describes the workflow and components of the FILM algorithm verbally and visually through figures, but it does not include a formally labeled pseudocode block or algorithm.
Open Source Code Yes Code and dataset are available at https://github. com/Zhaozixiang1228/IF-FILM.
Open Datasets Yes MSRS (Tang et al., 2022c), M3FD (Liu et al., 2022a) and Road Scene (Xu et al., 2020a) datasets for infrared-visible image fusion (IVF) task, the Harvard medical dataset (Johnson & Becker) for medical image fusion (MIF) task, the Real MFF (Zhang et al., 2020a) and Lytro (Nejati et al., 2015) datasets for multi-focus image fusion (MFF) task, and the SICE (Cai et al., 2018) and MEFB (Zhang, 2021a) datasets for multi-exposure image fusion (MEF) task. ... Code and dataset are available at https://github. com/Zhaozixiang1228/IF-FILM.
Dataset Splits Yes MSRS dataset: 1083 pairs for IVF training and 361 pairs for IVF testing., Road Scene dataset: 70 pairs for IVF validation, 70 pairs for IVF testing., SICE dataset: 499 pairs for MEF training and 90 pairs MEF testing., Real MFF dataset: 639 pairs for MFF training and 71 pairs for MFF testing.
Hardware Specification Yes A machine with eight NVIDIA Ge Force RTX 3090 GPUs is utilized for our experiments.
Software Dependencies No The paper mentions the use of Adam optimizer, Restormer blocks, and specific model architectures, but it does not provide specific version numbers for software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow) or programming languages.
Experiment Setup Yes We train the network for 300 epochs using the Adam optimizer, with an initial learning rate of 1e-4 and decreasing by a factor of 0.5 every 50 epochs. The Adam optimization strategy is employed with the batchsize set as 16. We incorporate Restormer blocks (Zamir et al., 2022) in both language-guided vision encoder V( ) and vision feature decoder D( ), with each block having 8 attention heads and a dimensionality of 64. M and N, representing the number of blocks in V( ) and D( ), are set to 2 and 3, respectively.