It’s Raw! Audio Generation with State-Space Models

Authors: Karan Goel, Albert Gu, Chris Donahue, Christopher Re

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiments We evaluate SASHIMI on several benchmark audio generation and unconditional speech generation tasks in both AR and non-AR settings, validating that SASHIMI generates more globally coherent waveforms than baselines while having higher computational and sample efficiency.
Researcher Affiliation Academia Karan Goel 1 Albert Gu 1 Chris Donahue 1 Christopher R e 1 ... 1Department of Computer Science, Stanford University. Correspondence to: Karan Goel <kgoel@cs.stanford.edu>, Albert Gu <albertgu@stanford.edu>.
Pseudocode No The paper describes the architecture and processes in text and diagrams but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing the source code for the SASHIMI methodology, nor does it provide a direct link to a code repository for SASHIMI.
Open Datasets Yes The datasets we used can be found on Huggingface datasets: Beethoven, You Tube Mix, SC09.
Dataset Splits Yes Table 1. Summary of music and speech datasets used for unconditional AR generation experiments. ... MUSIC YOUTUBEMIX ... 88% 6% 6%
Hardware Specification Yes All methods in the AR setting were trained on single V100 GPU machines. All diffusion models were trained on 8-GPU A100 machines.
Software Dependencies No The paper mentions adapting 'Py Torch implementation' for models but does not provide specific version numbers for PyTorch or any other software libraries or dependencies.
Experiment Setup Yes For all datasets, we use feature expansion of 2 when pooling, and use a feedforward dimension of 2 the model dimension in all inverted bottlenecks in the model. We use a model dimension of 64. For S4 parameters, we only train Λ and C with the recommended learning rate of 0.001, and freeze all other parameters for simplicity (including pp , B, dt). We train with 4 4 pooling for all datasets, with 8 S4 blocks per tier.