Predictive Dynamic Fusion
Authors: Bing Cao, Yinan Xia, Yi Ding, Changqing Zhang, Qinghua Hu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple benchmarks confirm our superiority. |
| Researcher Affiliation | Academia | 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Tianjin Key Lab of Machine Learning, Tianjin, China. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks (i.e., clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | Our code is available at https: //github.com/Yinan-Xia/PDF. |
| Open Datasets | Yes | Datasets. We evaluate the proposed method across various multimodal classification tasks, including Image-text classification: The UPMC FOOD101 dataset (Wang et al., 2015) contains noisy images and texts obtained in uncontrolled environments containing about 100,000 recipes for a total of 101 food categories. MVSA (Niu et al., 2016) is a sentiment analysis dataset that collects sentiment data for matched pairs of users texts and images; Scenes recognition: NYU Depth V2 (Silberman et al., 2012) is an indoor scenes dataset, both the RGB and Depth Cameras recorded the image-pairs; Emotion recognition: CREMA-D (Cao et al., 2014) is an audio-visual dataset designed for recognizing multi-modal emotion, demonstrating various basic emotional states (happy, sad, anger, fear, disgust, and neutral) through spoken sentences. Face recognition: PIE (Sim et al., 2003) is a pose, illumination, and expression database of over 40,000 facial images of 68 people. |
| Dataset Splits | No | The paper mentions the use of a "validation set" in Appendix D.2 ("across the entire validation set") and discusses training epochs and batch size, implying data splits, but it does not specify the exact percentages or sample counts for the training, validation, and test splits. |
| Hardware Specification | Yes | All the experiments were conducted on an NVIDIA A6000 GPU, using Py Torch with default parameters for all methods. |
| Software Dependencies | No | The paper mentions "Py Torch" as the software used but does not specify a version number or any other software dependencies with version details. |
| Experiment Setup | Yes | The network was trained for 100 epochs utilizing the Adam optimizer with β1 = 0.9, β2 = 0.999, weight decay of 0.01, dropout rate of 0.1, and a batch size of 16. The initial learning rate was chosen from the set {1e-8, 5e-5, 1e-4}. Specifically, for image-text classification, the initial learning rate was 5e-5; for scene recognition, it was 1e-8 for the second layer of the confidence predictor and 1e-4 for all others; for emotion recognition, it was set to 1e-3. |