Unsupervised Audiovisual Synthesis via Exemplar Autoencoders

Authors: Kangle Deng, Aayush Bansal, Deva Ramanan

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now quantitatively evaluate the proposed method for audio conversion in Sec. 4.1 and audiovisual synthesis in Sec. 4.2. We attempted to find all public codebases that could serve as baselines for audio conversion and audiovisual generation. We use existing datasets for a fair evaluation with prior art. We also introduce a new challenging dataset that consists of recordings in non-studio environments. Many of our motivating applications, such as education and assistive technology, require processing unstructured real-world data collected outside a studio. We urge the reader to see our summary video to see/hear our results, as well as the appendix for additional ablative analysis.
Researcher Affiliation Academia Kangle Deng, Aayush Bansal, Deva Ramanan Carnegie Mellon University Pittsburgh, PA 15213, USA {kangled,aayushb,deva}@cs.cmu.edu
Pseudocode No The paper provides network architecture diagrams (Figure 4) and describes the components in text, but it does not include pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Because access to code for generating fake training examples will be crucial for learning to identify fake content, we commit to making our code freely available. Appendix B provides extenstive analysis of audio detection of Exemplar Autoencoder fakes.
Open Datasets Yes Datasets: We use the publicly available VCTK dataset (Veaux et al., 2016), which contains 44 hours of utterances from 109 native speakers of English with various accents. Each speaker reads a different set of sentences, except for two paragraphs.
Dataset Splits No The paper specifies training and test data amounts for some experiments (e.g., Table 4 mentions '200 sentences' for training and '100 real sentences and 100 fake ones' for testing for a discriminator), but it does not provide explicit validation split information or percentages for the main model training.
Hardware Specification No The paper does not specify the hardware used for running experiments, such as GPU models, CPU types, or memory details.
Software Dependencies No The paper mentions software components like 'Wave Net vocoder' (Oord et al., 2016), 'batch normalization' (Ioffe & Szegedy, 2015), 'Re LU activation' (Krizhevsky et al., 2012), and 'bidirectional LSTM' (Hochreiter & Schmidhuber, 1997), but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Training Details: Our model is trained at a learning rate of 0.001 and a batch size of 8. To train a model from scratch, it needs about 30 minutes of the target speaker s speech data and around 10k iterations to converge. Although our main structure is straightforward, the vocoder is usually a large and complicated network, which needs another 50k iterations to train. However, transfer learning can be beneficial in reducing the number of iterations and necessary data for training purposes. When fine-tuning a new speaker s autoencoder from a pre-trained model, we only need about 3 minutes of speech from a new speaker. The entire model, including the vocoder, converges around 10k iterations.