Learning Signal-Agnostic Manifolds of Neural Fields

Authors: Yilun Du, Katie Collins, Josh Tenenbaum, Vincent Sitzmann

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the generality of GEM by first showing that our model is capable of fitting diverse signal modalities. Next, we demonstrate that our approach captures the underlying structure across these signals; we are not only able to cluster and perceptually interpolate between signals, but inpaint to complete partial ones. Finally, we show that we can draw samples from the learned manifold of each signal type, illustrating the power of GEM to be used a signal-agnostic generative model. We re-iterate that nearly identical architectures and training losses are used across separate modalities. and Quantitative Comparisons. Next, we provide quantitiative evaluations of generations in Table 2 on image and shape modalities.
Researcher Affiliation Academia yilundu@mit.edu Katherine Collins1,2 katiemc@mit.edu Joshua B. Tenenbaum1,2,3 jbt@mit.edu Vincent Sitzmann1 sitzmann@mit.edu 1MIT CSAIL 2MIT BCS 3 MIT CBMM
Pseudocode No The paper describes algorithms but does not include a dedicated pseudocode block or a clearly labeled 'Algorithm' figure.
Open Source Code Yes 1Code and additional results are available at https://yilundu.github.io/gem/.
Open Datasets Yes Datasets. We evaluate GEM on four signal modalities: image, audio, 3D shape, and cross-modal image and audio signals, respectively. For the image modality, we investigate performance on the Celeb A-HQ dataset [38] fit on 29000 64 64 training celebrity images, and test on 1000 64 64 test images. To study GEM behavior on audio signals, we use the NSynth dataset [39], and fit on a training set of 10000 one-second 16k Hz sounds clips of different instruments playing, and test of 5000 one-second 16k Hz sound clips. ... For the 3D shape domain, we work with the Shape Net dataset from [40]. ... Finally, for the cross-modal image and audio modality, we utilize the cello image and audio recordings from the Sub-URMP dataset [41].
Dataset Splits No The paper specifies training and test set sizes but does not explicitly mention a validation set or provide details on how a validation split was created or used (e.g., percentages or specific counts for validation).
Hardware Specification No The paper does not specify any particular hardware used for running the experiments (e.g., specific GPU or CPU models).
Software Dependencies No The paper mentions software like 'Pytorch VAE library' and 'Style GAN2' but does not provide specific version numbers for these libraries or for the underlying framework (e.g., PyTorch version).
Experiment Setup Yes We train all approaches with a latent dimension of 1024, and re-scale the size of the VAE to ensure parameter counts are similar. We report model architecture details in the appendix... We use a three-layer multilayer perceptron (MLP) with hidden dimension 512 as our Φ... our hypernetwork is parameterized as a three-layer MLP