Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation
Authors: Ui-Hyeop Shin, Sangyoun Lee, Taehan Kim, Hyung-Min Park
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results demonstrated that this asymmetric structure is effective and that the combination of proposed global and local Transformer can sufficiently replace the role of interand intra-chunk processing in dual-path structure. |
| Researcher Affiliation | Academia | Department of Electronic Engineering, Sogang University, Seoul, Republic of Korea {dmlguq123,leesy0882,taehank,hpark}@sogang.ac.kr |
| Pseudocode | No | The paper includes several block diagrams (e.g., Figure 1, Figure 2, Figure 4) illustrating the architecture and components, but no formal pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Not only did we include the implmentation code in supplemental material, but also we are planning to release our code for our main experiments in public soon. |
| Open Datasets | Yes | We evaluated our proposed Sep Reformer on WSJ0-2Mix [30], WHAM! [82], WHAMR! [50], and Libri Mix [18], which are popular datasets for monaural speech separation. |
| Dataset Splits | Yes | WSJ0-2Mix is the most popular dataset to benchmark the monaural speech separation task. It contains 30, 10, and 5 hours for training, validation, and evaluation sets, respectively. |
| Hardware Specification | Yes | All experiments were conducted on a server with GeForce RTX 3090 6. |
| Software Dependencies | No | The paper mentions optimizers (Adam W), activation functions (GELU, GLU), and various networks, but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | We trained the proposed Sep Reformer for a maximum of 200 epochs with an initial learning rate of 1.0e 3. We used a warm-up training scheduler for the first epoch, and then the learning rate decayed by a factor of 0.8 if the validation loss did not improve in three consecutive epochs. As optimizer, Adam W [43] was used with a weight decay of 0.01, and gradient clipping with a maximum L2-norm of 5 was applied for stable training. All models were trained with Permutation Invariant Training (PIT) [36]. When the multi-loss in Subsection 3.4 was applied, the α was set to 0.4, and after 100 epochs, it decayed by a factor of 0.8 at every five epochs. τ was set to 30 as in [89]. SI-SNRi and SDRi [73] were used as evaluation metrics. Also, we compared the parameter size and the number of multiply-accumulate operations (MACs) for 16000 samples. The number of heads in MSHA was commonly set to 8, and the kernel size K in the local block was set to 65. Also, we evaluated our model in various model sizes as follows: Sep Reformer-T/B/L: F = 64/128/256, Fo = 256, L = 16, H = 4, R = 4 Sep Reformer-S/M: F = 64/128, Fo = 256, L = 8, H = 2, R = 5 We used a longer encoder length of L = 32 in the Large model when evaluating the WHAMR dataset to account for reverberation. |