GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
Authors: Ding Jia, Jianyuan Guo, Kai Han, Han Wu, Chao Zhang, Chang Xu, Xinghao Chen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations across multimodal image-to-image translation, 3D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, Li DAR, event data, etc. demonstrate the superior performance of our Gemini Fusion against leading-edge techniques. |
| Researcher Affiliation | Collaboration | 1Peking University. 2The University of Sydney. 3Huawei Noah s Ark Lab. |
| Pseudocode | No | The paper describes its method using textual descriptions and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The Py Torch code is available here. |
| Open Datasets | Yes | For multimodal semantic segmentation experiments, we use the following datasets: NYUDv2 (Silberman et al., 2012) dataset, De Li VER (Zhang et al., 2023b) dataset, SUN RGB-D (Song et al., 2015) dataset. For the image-to-image translation task, we use Taskonomy (Zamir et al., 2018) dataset. For the 3D object detection task, we use KITTI 3D object detection (Geiger et al., 2012) dataset. |
| Dataset Splits | Yes | KITTI 3D object detection (Geiger et al., 2012) dataset contains 7481 training samples and 7518 test samples. ... the training set is further split into a training set and a validation set. After splitting, the training set consists of 3712 samples and the validation set consists of 3769 samples. |
| Hardware Specification | Yes | For model training, we employ NVIDIA V100 GPUs in configurations of 3, 4, and 8 units for the NYUDv2, SUN RGB-D, and De Li VER datasets, respectively |
| Software Dependencies | No | The paper mentions 'Py Torch code' and uses frameworks like 'Seg Former' and 'Swin Transformer' but does not specify version numbers for PyTorch or any other software libraries required for replication. |
| Experiment Setup | Yes | Our training hyper-parameters are developed by following the methodologies from the Token Fusion (Wang et al., 2022b) and CMNe Xt (Zhang et al., 2023b) codebases. ... Within our proposed Gemini Fusion model, we configure the number of attention heads to 8. To mitigate the risk of overfitting, we set the drop path rate to 0.4, while the drop rate remains at 0.0. Conversely, for the De Li VER dataset, our foundation training hyper-parameters are the same with CMNe Xt, which necessitates a smaller backbone. Consequently, we reduce the drop path rate to 0.2. All other parameters, including batch size, optimizer, weight decay, and learning rate scheduler, remain in line with CMNe Xt s original configuration, except for the learning rate, which is modified to 2e 4. |