Learning Large-scale Neural Fields via Context Pruned Meta-Learning

Authors: Jihoon Tack, Subin Kim, Sihyun Yu, Jaeho Lee, Jinwoo Shin, Jonathan Richard Schwarz

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide an extensive empirical evaluation on nine datasets across multiple multiple modalities, demonstrating state-of-the-art results while providing additional insight through careful analysis of the algorithmic components constituting our method.
Researcher Affiliation Academia Jihoon Tack1, Subin Kim1, Sihyun Yu1, Jaeho Lee2, Jinwoo Shin1, Jonathan Schwarz3 1Korea Advanced Institute of Science and Technology 2Pohang University of Science and Technology 3University College London
Pseudocode Yes Algorithm 1 Meta-training of Grad NCP; Algorithm 2 Meta-testing of Grad NCP
Open Source Code Yes Code is available at https://github.com/jihoontack/Grad NCP
Open Datasets Yes Datasets. Highlighting the versatility of NFs, we consider four different modalities, namely images, videos, audio, and manifolds. For images, we follow Learnit and use Celeb A [36], Imagenette [23], and pictures including Text [61] while additionally considering the high-resolution fine-grained datasets Celeb A-HQ [26] and AFHQ [7]. To test under high visual diversity we also use all 1,000 classes of Image Net [8]. Pushing memory requirements past current limits, we consider the video dataset UCF-101 [57] using resolutions (128 128 16, 256 256 32) to demonstrate the scalability of Grad NCP. Finally, we also test on audio using the speech dataset Librispeech [42] and on manifolds by considering climate data on the globe using ERA5 [22].
Dataset Splits Yes Celeb A. The dataset comprises 202,599 images, where we use 162K for training, 20K for validation, and 20K for testing.
Hardware Specification Yes For the main development, we mainly use Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz and a single RTX 3090 24GB GPU, except for high-resolution signals including Celeb A-HQ of 1024 1024 and UCF-101 of 256 256 32, where we use AMD EPYC 7542 32 Core Processor and a single NVIDIA A100 SXM4 40GB.
Software Dependencies No The paper mentions using specific models/architectures like SIREN [54] and Ne RV [5] and optimizers like Adam [32]. However, it does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x), which are necessary for reproducible software setup.
Experiment Setup Yes Network architecture details. For the main experiment, we mainly use SIREN, a multi-layer perception (MLP) with sinusoidal activation functions [54], i.e., x 7 sin(ω0(Wx + b)) where W, b are weight and biases of the MLP layer and ω0 is the fixed hyperparameter. For image, audio, and manifold datasets, we use SIREN with 5 layers with 256 hidden dimensions, and for video, we use 7 layers with the same hidden dimension. We used ω0 = 50 for the manifold dataset and use ω0 = 30 for the rest. We additionally consider Ne RV [5] for the video dataset. For the UCF-101 dataset of 128 128 16, we use 4 Ne RV blocks, and for 256 256 32, we use 5 Ne RV blocks. Finally, we consider the Fourier feature network (FFN) in Appendix D.3, where we use the same network size as SIREN and the use Fourier feature scale σ = 10. Training details. For all dataset, we use Adam optimizer [32] for the outer loop. We use the outer step of 150,000, except for learning the Image Net dataset where we use 500,000 steps. For SIREN, we use the outer learning rate of β = 3.0 10 6 for Librispeech and use β = 1.0 10 5 for the rest. For Ne RV, we use the outer learning rate of β = 1.0 10 4. As for the inner loop learning rate, we use α = 1.0 10 2, and α = 1.0 10 1 for SIREN and Ne RV, respectively. For inner step number K, Grad NCP is trained on a longer horizon than Learnit by multiplying 1/γ (which uses the same memory usage). For Learnit, we mainly use K = 4 for the main table, where we use K = 1 for Celeb A-HQ (1024 1024), K = 5 on UCF-101 (256 256 32) on Ne RV, and K = 20 on UCF-101 (128 128 16) on Ne RV. We use the same batch size for Learnit and Grad NCP to fairly use the memory where the size was selected differently across the dataset (e.g., under the given GPU memory budget). Hyperparameter details for Grad NCP. We find that the hyperparameter introduced by Grad NCP is not sensitive across datasets and architectures. For the context set selection ratio γ, i.e., the ratio of retaining coordinates, we use 0.25 for most of the dataset except for Librispeech and ERA5, where we used 0.5 (as we do not need to prune the context much for these low-resolution signals), and 0.5, 0.2 when training Ne RV on UCF-101 on 128 and 256 resolution, respectively. We found that for most of the datasets, the performance does not significantly decrease until γ = 0.2 while significantly reducing the memory, i.e., about 5 times. For bootstrap target correction hyperparameters, we used L = 5, and λ = 100, where we believe tuning these hyperparameters will indeed improve the performance much more (we did not tune extensively). Evaluation details. For the evaluation, to fairly compare with the baseline, we use the same test-time adaptation step for Learnit and Grad NCP (e.g., for Celeb A experiments we use Ktest = 16) which is the same step number that is used by Grad NCP on meta-training.