Balancing Multimodal Learning via Online Logit Modulation
Authors: Daoming Zong, Chaoyue Ding, Baoxiang Li, Jiakui Li, Ken Zheng
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evidence shows that our approach brings significant enhancements over baselines on a wide range of multimodal tasks, covering video, audio, text, image, and depth modalities. 4 Experiment |
| Researcher Affiliation | Industry | Daoming Zong , Chaoyue Ding , Baoxiang Li , Jiakui Li and Ken Zheng Sense Time Research {ecnuzdm, cydingcs}@gmail.com, {libaoxiang, lijiakui, zhengken}@sensetime.com |
| Pseudocode | Yes | Algorithm 1: Online Logit Modulation (OLM) |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for their methodology is publicly available. |
| Open Datasets | Yes | Kinetics-Sounds (KS) [Arandjelovic and Zisserman, 2017] is a subset of 36 human action classes selected from the Kinetics dataset [Kay et al., 2017]... VGGSound [Chen et al., 2020a] is a large-scale video dataset... AVE [Tian et al., 2018] is a subset of the Audio Set dataset [Gemmeke et al., 2017]... CREMA-D [Cao et al., 2014] is a multimodal dataset... MER-MULTI is a subchallenge of the MER2023 [Lian et al., 2023]... SUNRGBD V1 [Song et al., 2015] comprises 10,335 RGBD images... |
| Dataset Splits | Yes | This dataset contains 19,000 10second video clips, with 15,000 clips used for training, 1,900 for validation, and 1,900 for testing. (Kinetics-Sounds) After filtering out unavailable videos, we obtained 168,618 videos for training and validation, and 13,954 for testing. (VGGSound) The training, validation, and test sets are divided into 3,339, 402, and 402 samples, respectively. (AVE) It consists of 6,698 samples for training and validation, with 744 samples for testing. (CREMA-D) |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models, memory, or specific computing environments. |
| Software Dependencies | No | The paper mentions software like 'librosa' and specific models/toolkits (e.g., 'Res Net18', 'Open Face toolkit', 'MANet', 'Hu BERT', 'Mac BERT'), but it does not specify any version numbers for these software components, which is required for reproducible dependency descriptions. |
| Experiment Setup | Yes | Specifically, for OLM-Conv, we use Res Net18 [He et al., 2016] as the encoders following previous works [Zhao et al., 2018; Peng et al., 2022]. AVE, Kinetics-Sounds, and VGGSound datasets consist of videos with a duration of 10 seconds each. To process these videos, we extract frames at a rate of 1fps and uniformly sample 3 frames from each clip, which serve as the visual input for our model. For the audio data, we utilize a window of length 512 with an overlap of 353 to transform the raw audio data into spectrograms of size 257 1004 using the librosa [Mc Fee et al., 2015] library. As for SUNRGBD, we adopt Places CNN [Zhou et al., 2014]... Regarding CREMA-D, its video clips last from 2 to 3 seconds. From each clip in CREMAD, we extract 1 frame and use a window of length 512 with an overlap of 353 to convert the audio data into spectrograms of size 257 299. For MER-MULTI, we first extract human face images using the Open Face toolkit. The pre-trained MANet [Zhao et al., 2021], Hu BERT [Hsu et al., 2021], and Mac BERT [Cui et al., 2020] models were employed for the extraction of visual, audio, and textual features, respectively. For OLM-Trans, we stack six standard transformer blocks and IRFB blocks (cf. Sup. D for implementation details). |