reproducibilityindex.ai

AND: Audio Network Dissection for Interpreting Deep Acoustic Models

Authors: Tung-Yu Wu, Yu-Xiang Lin, Tsui-Wei Weng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted to verify AND s precise and informative descriptions.
Researcher Affiliation	Academia	1National Taiwan University, Taipei, Taiwan 2HDSI, UC San Diego, CA, USA.
Pseudocode	Yes	Algorithm 1 GET-UNINTERPRETABLE-NEURONS
Open Source Code	Yes	Our source code is available at https://github.com/Trustworthy ML-Lab/Audio Network Dissection
Open Datasets	Yes	For experiments in Section 4.1Section 4.5, we utilize the ESC50 (Piczak, 2015) as Dp and consider all its 50 audio classes as Dc.
Dataset Splits	No	The paper mentions using a 'testing set' (e.g., 'ESC50 testing set') and 'testing accuracies' but does not provide explicit details about the train/validation/test splits (e.g., percentages, counts, or a clear methodology for partitioning the data).
Hardware Specification	Yes	Captioning the entire ESC50 dataset takes approximately 40 hours on an NVIDIA-A6000 GPU, with batch size set to be 1. and The summarization process for all linear layers in AST and BEATs takes around 12 and 10 hours respectively on an NVIDIA RTX A6000 GPU.
Software Dependencies	Yes	We adopt Llama-2-chat-13B (Touvron et al., 2023) for all LLM-related experiments. The vllm package (Kwon et al., 2023) is employed to boost the inference efficiency, which integrates efficient attention mechanism and other speed-up techniques to create a memory-efficient LLM inference engine. The summarization process for all linear layers in AST and BEATs takes around 12 and 10 hours respectively on an NVIDIA RTX A6000 GPU. and For the CLIP model, we employ Vi T-B/32. For the CLAP model, we utilize the 630k-audioset-best version. To capture the representation of textual artifacts in AND, we use all-Mini LM-L12-v2 pre-trained model provided by Reimers & Gurevych (2019).
Experiment Setup	Yes	For all experiments, we use K = 5 to select top-K highly/lowly activated samples, t = 0.7 to remove similar sentences in summary calibration module. and To obtain caption Dd of audio clips in Dp, we adopt SALMONN (Tang et al., 2024) as the open-domain audio captioning model. We feed each audio ai into SALMONN to acquire di, forming Dd = {d1, . . . , d N}. and For the CLIP model, we employ Vi T-B/32. For the CLAP model, we utilize the 630k-audioset-best version. To capture the representation of textual artifacts in AND, we use all-Mini LM-L12-v2 pre-trained model provided by Reimers & Gurevych (2019). and We adopt Llama-2-chat-13B (Touvron et al., 2023) for all LLM-related experiments.