Chunked Autoregressive GAN for Conditional Waveform Synthesis
Authors: Max Morrison, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, Yoshua Bengio
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train Hi Fi-GAN (V1) on the VCTK dataset (Yamagishi et al., 2019) and evaluated the pitch accuracy on 256 randomly selected sentences from a validation set containing speakers seen during training. We perform spectrogram-to-waveform inversion on speech. All models are trained with a batch size of 64. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 2 × 10−4 and β = (.8, .99). |
| Researcher Affiliation | Collaboration | Max Morrison Northwestern University morrimax@u.northwestern.edu Rithesh Kumar, Kundan Kumar1 & Prem Seetharaman Descript, Inc. {rithesh, kundan, prem}@descript.com Aaron Courville1, 2 & Yoshua Bengio1, 3 1 Mila, Qu ebec Artificial Intelligence Institute, Universit e de Montr eal |
| Pseudocode | No | The paper describes the model architecture and experimental procedures in narrative text and figures, but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We also make our code available under an opensource license1 Code is available at https://github.com/descriptinc/cargan. In order to facilitate reproduction of our research, we provide documented, opensource code that permits reproducing and evaluating all experiments in our paper... |
| Open Datasets | Yes | We trained Hi Fi-GAN (V1) on the VCTK dataset (Yamagishi et al., 2019) and evaluate on both VCTK and DAPS (Mysore, 2014). For evaluation on DAPS, we use the segmented dataset of the first script of the clean partition available on Zenodo (Morrison et al., 2021). |
| Dataset Splits | Yes | For training on VCTK, we randomly select 100 speakers. We train on a random 95% of the data from these 100 speakers, using data from both microphones. evaluated the pitch accuracy on 256 randomly selected sentences from a validation set containing speakers seen during training. |
| Hardware Specification | Yes | We use a single RTX A6000 for training and generation on a GPU, and two cores of an AMD EPYC 7742 with one thread per core and a 2.25 GHz maximum clock speed for CPU benchmarking. |
| Software Dependencies | No | The paper mentions 'torchcrepe' and 'PyTorch Hub' but does not provide specific version numbers for all key software dependencies needed for reproducibility, such as PyTorch or Python. |
| Experiment Setup | Yes | All models are trained with a batch size of 64. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with a learning rate of 2 × 10−4 and β = (.8, .99). We use an exponential learning rate schedule that multiplies the learning rate by .999 after each epoch. All models are trained for 500,000 steps. |