AND: Audio Network Dissection for Interpreting Deep Acoustic Models
Authors: Tung-Yu Wu, Yu-Xiang Lin, Tsui-Wei Weng
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted to verify AND s precise and informative descriptions. |
| Researcher Affiliation | Academia | 1National Taiwan University, Taipei, Taiwan 2HDSI, UC San Diego, CA, USA. |
| Pseudocode | Yes | Algorithm 1 GET-UNINTERPRETABLE-NEURONS |
| Open Source Code | Yes | Our source code is available at https://github.com/Trustworthy ML-Lab/Audio Network Dissection |
| Open Datasets | Yes | For experiments in Section 4.1Section 4.5, we utilize the ESC50 (Piczak, 2015) as Dp and consider all its 50 audio classes as Dc. |
| Dataset Splits | No | The paper mentions using a 'testing set' (e.g., 'ESC50 testing set') and 'testing accuracies' but does not provide explicit details about the train/validation/test splits (e.g., percentages, counts, or a clear methodology for partitioning the data). |
| Hardware Specification | Yes | Captioning the entire ESC50 dataset takes approximately 40 hours on an NVIDIA-A6000 GPU, with batch size set to be 1. and The summarization process for all linear layers in AST and BEATs takes around 12 and 10 hours respectively on an NVIDIA RTX A6000 GPU. |
| Software Dependencies | Yes | We adopt Llama-2-chat-13B (Touvron et al., 2023) for all LLM-related experiments. The vllm package (Kwon et al., 2023) is employed to boost the inference efficiency, which integrates efficient attention mechanism and other speed-up techniques to create a memory-efficient LLM inference engine. The summarization process for all linear layers in AST and BEATs takes around 12 and 10 hours respectively on an NVIDIA RTX A6000 GPU. and For the CLIP model, we employ Vi T-B/32. For the CLAP model, we utilize the 630k-audioset-best version. To capture the representation of textual artifacts in AND, we use all-Mini LM-L12-v2 pre-trained model provided by Reimers & Gurevych (2019). |
| Experiment Setup | Yes | For all experiments, we use K = 5 to select top-K highly/lowly activated samples, t = 0.7 to remove similar sentences in summary calibration module. and To obtain caption Dd of audio clips in Dp, we adopt SALMONN (Tang et al., 2024) as the open-domain audio captioning model. We feed each audio ai into SALMONN to acquire di, forming Dd = {d1, . . . , d N}. and For the CLIP model, we employ Vi T-B/32. For the CLAP model, we utilize the 630k-audioset-best version. To capture the representation of textual artifacts in AND, we use all-Mini LM-L12-v2 pre-trained model provided by Reimers & Gurevych (2019). and We adopt Llama-2-chat-13B (Touvron et al., 2023) for all LLM-related experiments. |