Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

Authors: Ui-Hyeop Shin, Sangyoun Lee, Taehan Kim, Hyung-Min Park

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results demonstrated that this asymmetric structure is effective and that the combination of proposed global and local Transformer can sufficiently replace the role of interand intra-chunk processing in dual-path structure.
Researcher Affiliation Academia Department of Electronic Engineering, Sogang University, Seoul, Republic of Korea {dmlguq123,leesy0882,taehank,hpark}@sogang.ac.kr
Pseudocode No The paper includes several block diagrams (e.g., Figure 1, Figure 2, Figure 4) illustrating the architecture and components, but no formal pseudocode or algorithm blocks are provided.
Open Source Code Yes Not only did we include the implmentation code in supplemental material, but also we are planning to release our code for our main experiments in public soon.
Open Datasets Yes We evaluated our proposed Sep Reformer on WSJ0-2Mix [30], WHAM! [82], WHAMR! [50], and Libri Mix [18], which are popular datasets for monaural speech separation.
Dataset Splits Yes WSJ0-2Mix is the most popular dataset to benchmark the monaural speech separation task. It contains 30, 10, and 5 hours for training, validation, and evaluation sets, respectively.
Hardware Specification Yes All experiments were conducted on a server with GeForce RTX 3090 6.
Software Dependencies No The paper mentions optimizers (Adam W), activation functions (GELU, GLU), and various networks, but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes We trained the proposed Sep Reformer for a maximum of 200 epochs with an initial learning rate of 1.0e 3. We used a warm-up training scheduler for the first epoch, and then the learning rate decayed by a factor of 0.8 if the validation loss did not improve in three consecutive epochs. As optimizer, Adam W [43] was used with a weight decay of 0.01, and gradient clipping with a maximum L2-norm of 5 was applied for stable training. All models were trained with Permutation Invariant Training (PIT) [36]. When the multi-loss in Subsection 3.4 was applied, the α was set to 0.4, and after 100 epochs, it decayed by a factor of 0.8 at every five epochs. τ was set to 30 as in [89]. SI-SNRi and SDRi [73] were used as evaluation metrics. Also, we compared the parameter size and the number of multiply-accumulate operations (MACs) for 16000 samples. The number of heads in MSHA was commonly set to 8, and the kernel size K in the local block was set to 65. Also, we evaluated our model in various model sizes as follows: Sep Reformer-T/B/L: F = 64/128/256, Fo = 256, L = 16, H = 4, R = 4 Sep Reformer-S/M: F = 64/128, Fo = 256, L = 8, H = 2, R = 5 We used a longer encoder length of L = 32 in the Large model when evaluating the WHAMR dataset to account for reverberation.