It’s Raw! Audio Generation with State-Space Models
Authors: Karan Goel, Albert Gu, Chris Donahue, Christopher Re
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experiments We evaluate SASHIMI on several benchmark audio generation and unconditional speech generation tasks in both AR and non-AR settings, validating that SASHIMI generates more globally coherent waveforms than baselines while having higher computational and sample efficiency. |
| Researcher Affiliation | Academia | Karan Goel 1 Albert Gu 1 Chris Donahue 1 Christopher R e 1 ... 1Department of Computer Science, Stanford University. Correspondence to: Karan Goel <kgoel@cs.stanford.edu>, Albert Gu <albertgu@stanford.edu>. |
| Pseudocode | No | The paper describes the architecture and processes in text and diagrams but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include an unambiguous statement that the authors are releasing the source code for the SASHIMI methodology, nor does it provide a direct link to a code repository for SASHIMI. |
| Open Datasets | Yes | The datasets we used can be found on Huggingface datasets: Beethoven, You Tube Mix, SC09. |
| Dataset Splits | Yes | Table 1. Summary of music and speech datasets used for unconditional AR generation experiments. ... MUSIC YOUTUBEMIX ... 88% 6% 6% |
| Hardware Specification | Yes | All methods in the AR setting were trained on single V100 GPU machines. All diffusion models were trained on 8-GPU A100 machines. |
| Software Dependencies | No | The paper mentions adapting 'Py Torch implementation' for models but does not provide specific version numbers for PyTorch or any other software libraries or dependencies. |
| Experiment Setup | Yes | For all datasets, we use feature expansion of 2 when pooling, and use a feedforward dimension of 2 the model dimension in all inverted bottlenecks in the model. We use a model dimension of 64. For S4 parameters, we only train Λ and C with the recommended learning rate of 0.001, and freeze all other parameters for simplicity (including pp , B, dt). We train with 4 4 pooling for all datasets, with 8 S4 blocks per tier. |