Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Teaching Physical Awareness to LLMs through Sounds
Authors: Weiguo Wang, Andy Nie, Wenrui Zhou, Yi Kai, Chengchen Hu
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate reasonable results in both simulated and real-world tasks, such as line-of-sight detection, Doppler effect estimation, and Direction-of-Arrival estimation, paving the way for enabling LLMs to understand physical world. [...] We train the model using AQA-PHY. The model achieves strong results across all tasks: 0.924 accuracy in line-of-sight detection, 0.181 MAE in Doppler effect estimation, 0.907 MAE in direction of arrival estimation, 0.903 accuracy in multipath analysis, and 1.599 relative error percentage in range estimation. These demonstrate the feasibility of teaching LLMs to understand physical phenomena through sound. |
| Researcher Affiliation | Collaboration | 1NIO 2Peking University. |
| Pseudocode | No | The paper describes the model architecture, audio encoder, channel simulator, and training process in detail, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | Sound Source. The synthesis workflow begins by sampling sound sources from an existing dataset. For this, we utilize Audio Set (Gemmeke et al., 2017), which offers approximately 2 million 10-second sound clips annotated with over 500 labels. |
| Dataset Splits | No | For each task, we generate 200,000 closed-form datapoints and 10,000 open-form datapoints. The paper mentions training and validation loss in Figure 5, implying splits were used, but does not specify explicit percentages or sample counts for training, validation, or test sets in the text. |
| Hardware Specification | Yes | The model is trained on 4 NVIDIA A100 GPUs with batch size 32 and completes after 7 epochs. [...] In line with standard practices in vehicle audio systems, four omni-directional microphones are deployed throughout a NIO ES6 vehicle cabin. |
| Software Dependencies | No | Our audio encoder is initialized from Whisper-large-v2 (Radford et al., 2023) to leverage pretrained magnitude representations, while the phase-related subnetwork is trained from scratch to capture fine-grained physical cues critical for physical awareness. LLMs are fine-tuned using Lo RA (Hu et al., 2022) to reduce the training workload and to leverage its linguistic capabilities. We train the models upon MS-SWIFT (Zhao et al., 2024) with modifications to accommodate our specific model architecture. The paper mentions several models and frameworks but does not specify version numbers for general software dependencies like Python, PyTorch, etc. |
| Experiment Setup | Yes | The model is trained on 4 NVIDIA A100 GPUs with batch size 32 and completes after 7 epochs. The total training time is about 61 hours. For response generation, the decoding parameters are set with temperature 1, top-p 1, and top-k 50. Appendix C lists more train hyperparameters. Table 8. Training Hyperparameters: GPUs 4 NVIDIA A100, Global Batch Size 32, Epochs 7, Optimizer Adam W, Optimizer Parameters β1 = 0.9, β2 = 0.95, ϵ = 1e-8, Learning Rate Schedule Warmup Decay LR, Weight Decay 0.1, Warm-up Min Learning Rate 0, Warm-up Max Learning Rate 0.0001, Warm-up Ratio 0.05, Lo RA Rank 8, Lo RA Alpha 32, Lo RA Dropout 0.05. |