Reference-Based Speech Enhancement via Feature Alignment and Fusion Network
Authors: Huanjing Yue, Wenxin Duo, Xiulian Peng, Jingyu Yang11648-11656
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that the performance of the proposed FAF-Net is close to the state-of-the-art speech enhancement methods on both DNS and Voice Bank+DEMAND datasets. |
| Researcher Affiliation | Academia | 1School of Electrical and Information Engineering, Tianjin University, China 2Individual {huanjing.yue, hiedean, yjy}@tju.edu.cn |
| Pseudocode | No | The paper describes methods through text and diagrams (Figure 2, Figure 3), but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Hie Dean/FAF-Net. |
| Open Datasets | Yes | Two benchmark datasets, i.e. DNS Challenge and Voice Bank+DEMAND, are used in our experiments. Voice Bank + DEMAND Although Voice Bank + DEMAND dataset (Valentini-Botinhao et al. 2016)... DNS Challenge The Interspeech 2020 DNS challenge dataset (Reddy et al. 2020)... |
| Dataset Splits | No | For training, 40 different noise conditions are mixed with the clean speech from training set, generating 11572 noisy-clean speech pairs. For testing, 20 noise conditions are mixed with the clean speech from test set, generating 824 noisy-clean speech pairs. (For Voice Bank + DEMAND) and Following (Zheng et al. 2020), we synthesize 500 hours noisy clips... for training. For evaluation, we use another 150 noisy clips from the test set... (For DNS Challenge). No specific validation split mentioned. |
| Hardware Specification | No | The paper states 'The proposed method is implemented in Pytorch', but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'The proposed method is implemented in Pytorch' but does not specify the version of PyTorch or any other software dependencies with their version numbers. |
| Experiment Setup | Yes | The proposed method is implemented in Pytorch. The batch size is set to 32 and an Adam optimizer is used. The frequency bins Fi (i = 1, 2, 3) is set to 128, 64, and 32, respectively. The channel number Ci (i = 1, 2, 3) is set to 32, 64, and 128, respectively. All audios are resampled to 16k Hz. STFT is calculated using Hann window with a window length of 512 samples, and the hop length is 128 samples. The FFT size is also set to 512 samples. The MFCC features are calculated with the same window length and hop length. In this way, the MFCC features and STFT spectrograms are frame-level matched along the time dimension. For Voice Bank-DEMAND, we train the first stage for 100 epochs and then we train the second stage for another 100 epochs with the first stage model fixed. The learning rate is initially set to 1e-4, and reduced to 1e-5 after 80 epochs for both stage training. Considering the training cost for two stage model, for DNS, we only present its one stage result which is generated by training for 50 epochs. |