Adversarial Generation of Time-Frequency Features with application in audio synthesis
Authors: Andrés Marafioti, Nathanaël Perraudin, Nicki Holighaus, Piotr Majdak
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the potential of deliberate generative TF modeling by training a generative adversarial network (GAN) on short-time Fourier features. We show that by applying our guidelines, our TF-based network was able to outperform a state-of-the-art GAN generating waveforms directly, despite the similar architecture in the two networks. ... To evaluate the performance of Ti FGAN, we trained Ti FGAN-M and Ti FGAN-MTF using the procedure outlined above on two datasets from (Donahue et al., 2019): (a) Speech, a subset of spoken digits zero through nine (sc09) from the Speech Commands Dataset (Warden, 2018). (b) Music, a dataset of 25 minutes of piano recordings of Bach compositions, segmented into approximately 19,000 overlapping samples of 1 s duration. |
| Researcher Affiliation | Academia | 1Acoustics Research Institute, Austrian Academy of Sciences, Wohllebengasse 12 14, 1040 Vienna, Austria. 2Swiss Data Science Center, ETH Z urich, Universit atstrasse 25, 8006 Z urich. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It includes diagrams of the network architecture, but no text-based algorithmic steps. |
| Open Source Code | Yes | Our software, complemented by instructive examples, is available at http://tifgan.github.io. |
| Open Datasets | Yes | To evaluate the performance of Ti FGAN, we trained Ti FGAN-M and Ti FGAN-MTF using the procedure outlined above on two datasets from (Donahue et al., 2019): (a) Speech, a subset of spoken digits zero through nine (sc09) from the Speech Commands Dataset (Warden, 2018). |
| Dataset Splits | No | The paper mentions using two datasets but does not provide specific details on how these datasets were split into training, validation, or test sets (e.g., percentages or exact counts), nor does it reference predefined splits with citations for reproducibility. |
| Hardware Specification | Yes | We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. |
| Software Dependencies | No | The paper mentions software components like 'SciPy and Tensorflow', 'Large Time-Frequency Analysis Toolbox (LTFAT, Pr uˇsa et al., 2014)', and 'ADAM optimizer (Kingma & Ba, 2015)', but it does not specify version numbers for these libraries or frameworks, which are crucial for reproducibility. |
| Experiment Setup | Yes | For the short-time Fourier transform, we fix the minimal redundancy that we consider reliable, i.e., M/a = 4 and select a = 128, M = 512, such that MR = 257, N = L/a = 128 and the STFT matrix S is of size CMR N. ... The dynamic range of the log-magnitude is limited by clipping at r (in our experiments r = 10), before scaling and shifting to the range of the generator output [ 1, 1]... Our networks were trained for 200k steps... We optimized the Wasserstein loss (Gulrajani et al., 2017) with the gradient penalty hyperparameter set to 10 using the ADAM optimizer (Kingma & Ba, 2015) with α = 10 4, β1 = 0.5, β2 = 0.9 and performed 5 updates of the discriminator for every update of the generator. |