Active Contrastive Set Mining for Robust Audio-Visual Instance Discrimination

Authors: Hanyu Xuan, Yihong Xu, Shuo Chen, Zhiliang Wu, Jian Yang, Yan Yan, Xavier Alameda-Pineda

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on both action and sound recognition on multiple datasets show the remarkably improved performance of our method.
Researcher Affiliation Academia 1School of Computer Science and Engineering, Nanjing University of Science and Technology, China 2Inria, University Grenoble Alpes, CNRS, Grenoble INP, LJK, Grenoble, France 3RIKEN Center for Advanced Intelligence Project, Japan 4Department of Computer Science, Illinois Institute of Technology, USA
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the release of source code or a link to a code repository.
Open Datasets Yes We separately utilize Kinetics-400 dataset containing 240K videos (Sect.5.2) and Audioset dataset with a random sample of 100K videos (Sect.5.3) to pre-train our model. ... We evaluate the visual representations, we compare the transfer performance of action recognition with previous self-supervised methods on UCF-101 and HMDB-51 datasets. ... We evaluate the audio representations on ESC-50 and DCASE datasets for sound recognition.
Dataset Splits No The paper mentions evaluating performance on test samples and training classifiers, but does not provide specific train/validation/test split percentages or counts for any dataset used in their experiments.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU or CPU models.
Software Dependencies No The paper does not list specific version numbers for software dependencies (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes For fair comparisons, we follow the same settings of [Morgado et al., 2021b]. Respectively, the input of video encoder is set as 8 frames of size 224 224 (Sect.5.2) and 16 frames of size 122 122 (Sect.5.3). The spectrogram size is set as 200 257 (Sect.5.2) and 100 129 (Sect.5.3). ... The temperature parameter τ in Eq.1 and Eq.5 is set to 0.07. The momentum coefficient m is set to 0.9. The number of semantic libraries C is set to 50. The size of the contrastive set K is set to 8192. ... Our model is trained with the Adam optimizer for 400 epochs with a learning rate of 1e 4, weight decay of 1e 5, and batch size of 256.