Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization

Authors: Yang Zhao, Chen Zhang, Haifeng Huang, Haoyuan Li, Zhou Zhao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The extensive experiments on various publicly-available benchmarks demonstrate that TURN can achieve competitive performance compared with the state-of-the-art approaches without using any data in this field, which verifies the feasibility of our proposed mechanisms and strategies.
Researcher Affiliation Collaboration Yang Zhao , Chen Zhang , Haifeng Huang , Haoyuan Li, and Zhou Zhao Zhejiang University Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies {awalk, zc99, huanghaifeng, lihaoyuan, zhaozhou}@zju.edu.cn
Pseudocode No The paper describes the model architecture and training process using textual descriptions and mathematical equations (e.g., Equations 1-14), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format.
Open Source Code Yes The code is available at https://github.com/Awalk ZY/TURN.
Open Datasets Yes Image Grounding Datasets We choose Ref COCO [65]/Ref COCO+ [65]/Ref COCOg [37] as the image grounding datasets and conduct in-depth studies on the Ref COCOg dataset... Audio Retrieval Datasets As for the audio-text datasets, we use Clotho [19] and Audio Caps [29]... Sounding Object Localization Datasets For the dev set, we use the annotated subset of Flickr Sound Net [48]... For the test set, we choose VGGSS [14] and MUSIC [70].
Dataset Splits Yes We follow the data partitions from Yu et al. [65]4 and only use the training set in the overall learning process. (Section 4.1) and For the dev set, we use the annotated subset of Flickr Sound Net [48]... For the test set, we choose VGGSS [14] and MUSIC [70].
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models (e.g., NVIDIA A100), CPU models, or memory specifications.
Software Dependencies No The paper describes the use of various components like 'pre-trained DETR backbone [10]' and 'transformer encoders', but it does not specify software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or library versions (e.g., CUDA 11.1).
Experiment Setup Yes Given the annotation b from the source domain, we follow the learning strategy adopted in [17] and apply a combination of regression-based and Io U-based loss functions to optimize the localization stream, formulated as Lloc = λ1Lreg(ˆb, b) + λ2Liou(ˆb, b)... Lcontra = 1 / i=1 log( exp(D( tm i , am i )/τc) Pb j=1 exp(D( tm i , am j )/τc) )... L = Lloc + Laln to learn our proposed architecture, where λ1, . . . , λ5 are the balancing factors to control the magnitude of corresponding terms. (Section 3.5) and when the codebook number is set to be 2 and the total codebook size is set to be 512, TURN gets the highest c Io U and AUC.