Predictive Dynamic Fusion

Authors: Bing Cao, Yinan Xia, Yi Ding, Changqing Zhang, Qinghua Hu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple benchmarks confirm our superiority.
Researcher Affiliation Academia 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Tianjin Key Lab of Machine Learning, Tianjin, China.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks (i.e., clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code Yes Our code is available at https: //github.com/Yinan-Xia/PDF.
Open Datasets Yes Datasets. We evaluate the proposed method across various multimodal classification tasks, including Image-text classification: The UPMC FOOD101 dataset (Wang et al., 2015) contains noisy images and texts obtained in uncontrolled environments containing about 100,000 recipes for a total of 101 food categories. MVSA (Niu et al., 2016) is a sentiment analysis dataset that collects sentiment data for matched pairs of users texts and images; Scenes recognition: NYU Depth V2 (Silberman et al., 2012) is an indoor scenes dataset, both the RGB and Depth Cameras recorded the image-pairs; Emotion recognition: CREMA-D (Cao et al., 2014) is an audio-visual dataset designed for recognizing multi-modal emotion, demonstrating various basic emotional states (happy, sad, anger, fear, disgust, and neutral) through spoken sentences. Face recognition: PIE (Sim et al., 2003) is a pose, illumination, and expression database of over 40,000 facial images of 68 people.
Dataset Splits No The paper mentions the use of a "validation set" in Appendix D.2 ("across the entire validation set") and discusses training epochs and batch size, implying data splits, but it does not specify the exact percentages or sample counts for the training, validation, and test splits.
Hardware Specification Yes All the experiments were conducted on an NVIDIA A6000 GPU, using Py Torch with default parameters for all methods.
Software Dependencies No The paper mentions "Py Torch" as the software used but does not specify a version number or any other software dependencies with version details.
Experiment Setup Yes The network was trained for 100 epochs utilizing the Adam optimizer with β1 = 0.9, β2 = 0.999, weight decay of 0.01, dropout rate of 0.1, and a batch size of 16. The initial learning rate was chosen from the set {1e-8, 5e-5, 1e-4}. Specifically, for image-text classification, the initial learning rate was 5e-5; for scene recognition, it was 1e-8 for the second layer of the confidence predictor and 1e-4 for all others; for emotion recognition, it was set to 1e-3.