Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization
Authors: Yang Zhao, Chen Zhang, Haifeng Huang, Haoyuan Li, Zhou Zhao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The extensive experiments on various publicly-available benchmarks demonstrate that TURN can achieve competitive performance compared with the state-of-the-art approaches without using any data in this field, which verifies the feasibility of our proposed mechanisms and strategies. |
| Researcher Affiliation | Collaboration | Yang Zhao , Chen Zhang , Haifeng Huang , Haoyuan Li, and Zhou Zhao Zhejiang University Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies {awalk, zc99, huanghaifeng, lihaoyuan, zhaozhou}@zju.edu.cn |
| Pseudocode | No | The paper describes the model architecture and training process using textual descriptions and mathematical equations (e.g., Equations 1-14), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | The code is available at https://github.com/Awalk ZY/TURN. |
| Open Datasets | Yes | Image Grounding Datasets We choose Ref COCO [65]/Ref COCO+ [65]/Ref COCOg [37] as the image grounding datasets and conduct in-depth studies on the Ref COCOg dataset... Audio Retrieval Datasets As for the audio-text datasets, we use Clotho [19] and Audio Caps [29]... Sounding Object Localization Datasets For the dev set, we use the annotated subset of Flickr Sound Net [48]... For the test set, we choose VGGSS [14] and MUSIC [70]. |
| Dataset Splits | Yes | We follow the data partitions from Yu et al. [65]4 and only use the training set in the overall learning process. (Section 4.1) and For the dev set, we use the annotated subset of Flickr Sound Net [48]... For the test set, we choose VGGSS [14] and MUSIC [70]. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models (e.g., NVIDIA A100), CPU models, or memory specifications. |
| Software Dependencies | No | The paper describes the use of various components like 'pre-trained DETR backbone [10]' and 'transformer encoders', but it does not specify software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or library versions (e.g., CUDA 11.1). |
| Experiment Setup | Yes | Given the annotation b from the source domain, we follow the learning strategy adopted in [17] and apply a combination of regression-based and Io U-based loss functions to optimize the localization stream, formulated as Lloc = λ1Lreg(ˆb, b) + λ2Liou(ˆb, b)... Lcontra = 1 / i=1 log( exp(D( tm i , am i )/τc) Pb j=1 exp(D( tm i , am j )/τc) )... L = Lloc + Laln to learn our proposed architecture, where λ1, . . . , λ5 are the balancing factors to control the magnitude of corresponding terms. (Section 3.5) and when the codebook number is set to be 2 and the total codebook size is set to be 512, TURN gets the highest c Io U and AUC. |