Latent Autoregressive Source Separation

Authors: Emilian Postolache, Giorgio Mariani, Michele Mancusi, Andrea Santilli, Luca Cosmo, Emanuele Rodolà

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our method on images and audio with several sampling strategies (e.g., ancestral, beam search) showing competitive results with existing approaches in terms of separation quality while offering at the same time significant speedups in terms of inference time and scalability to higher dimensional data. We perform quantitative and qualitative experiments on various datasets to demonstrate the efficacy and scalability of LASS. In the image domain, we evaluate on MNIST (Lecun et al. 1998) and Celeb A (32 × 32) (Liu et al. 2015) and present qualitative results on the higher resolution datasets Celeb A-HQ (256 × 256) (Karras et al. 2018) and Image Net (256 × 256) (Deng et al. 2009). In the audio domain, we test on Slakh2100 (Manilow et al. 2019).
Researcher Affiliation Academia 1 Sapienza University of Rome, Italy 2 Ca Foscari University of Venice, Italy 3 University of Lugano, Switzerland postolache@di.uniroma1.it, mariani@di.uniroma1.it, mancusi@di.uniroma1.it
Pseudocode Yes Algorithm 1: LASS inference Input: y Output: x1, x2
Open Source Code Yes Implementation details for all the models are listed on the companion website3. 3github.com/gladia-research-group/latent-autoregressive-source-separation
Open Datasets Yes In the image domain, we evaluate on MNIST (Lecun et al. 1998) and Celeb A (32 × 32) (Liu et al. 2015) and present qualitative results on the higher resolution datasets Celeb A-HQ (256 × 256) (Karras et al. 2018) and Image Net (256 × 256) (Deng et al. 2009). In the audio domain, we test on Slakh2100 (Manilow et al. 2019), a large dataset for music source separation suitable for generative modeling.
Dataset Splits Yes In order to choose the best sampler for this dataset, we validate the set of samplers in Table 3 on 1,000 mixtures constructed from the test split. The validation dataset is constructed similarly (with different music chunks).
Hardware Specification Yes We conducted all our experiments on a single Nvidia RTX 3090 GPU with 24 GB of VRAM.
Software Dependencies No The paper mentions common frameworks like "Transformer architecture" and implicitly deep learning libraries, but it does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, etc.). It mentions "Implementation details for all the models are listed on the companion website" but these details are not present in the paper itself.
Experiment Setup Yes We use K = 256 codes on MNIST and K = 512 on Celeb A... On Celeb A-HQ the VQ-GAN has K = 1024 codes, while on Image Net has K = 16384. We scale the likelihood term by multiplying it by λ = 3. For each mixture in the test set we sample a candidate batch of 512 separations, select the separation whose sum better matches the mixture (w.r.t. the L2 distance), and finally perform the refinement procedure in Eqs. (5), (6) with T = 500 and α = 0.1. As a sampling strategy, we use beam search since it shows the best results on a validation of 50 mixtures (Table 3), using B = 100 beams.