Toward Large Kernel Models

Authors: Amirhesam Abedsoltan, Mikhail Belkin, Parthe Pandit

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our numerical experiments, we train models with up to 1 million centers on 5 million samples. To the best of our knowledge, this was not achievable with any other existing method. Our work provides a path forward to scaling both the model size and size of the training dataset, independently.
Researcher Affiliation Academia 1 Department of Computer Science and Engineering, and 2Halicioglu Data Science Institute, UC San Diego, USA. Correspondence to: <aabedsoltan, parthepandit@ucsd.edu>.
Pseudocode Yes Algorithm 1 Eigen Pro 3.0 Exact-Projection; Algorithm 2 Eigen Pro 3.0; Algorithm 3 Eigen Pro 2.0(X, y). Solves the linear system K(X, X)θ = y
Open Source Code Yes A Python package is available at github.com/EigenPro3
Open Datasets Yes We perform experiments on these datasets: (1) CIFAR10, CIFAR10* (Krizhevsky et al., 2009), (2) CIFAR5M, CIFAR5M (Nakkiran et al., 2021), (3) Image Net , (Deng et al., 2009), (4) MNIST, (Le Cun, 1998), (5) MNIST8M, (Loosli et al., 2007), (6) Fashion MNIST, (Xiao et al., 2017), (7) Webvision , (Li et al., 2017), and (8) Librispeech, (Panayotov et al., 2015).
Dataset Splits No The paper mentions training and testing data (e.g., 'train-clean-100 and train-clean-300 (5M samples) as our training data, test-clean as our test set' for Librispeech), but it does not explicitly provide validation dataset splits or detailed split percentages for all datasets used, nor does it specify split methodologies like random seeds or stratification.
Hardware Specification Yes We used machines with 2x NVIDIA-V100 and 8x NVIDIA-A100 GPUs, with a V-RAM of 32GB and 40GB respectively, and 8x cores of Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz with a RAM of 100 GB.
Software Dependencies No The paper mentions software like Scikit-learn, timm, Scipy, and ESPnet, but it does not provide specific version numbers for these software components, only citations to their respective papers.
Experiment Setup Yes The only hyperparameters that we need to set are s, q for outer gradient step, and σ, ξ for projection sub-problem. For σ, ξ, we used the same criteria as Ma & Belkin (2019) to optimally use GPU utilization. For s, q, we prefer larger q because as it is explained in Ma et al. (2018), larger q allows for larger learning rate and better condition number. We set batch size and learning rate automatically using the new top eigenvalue as it is explained in Ma & Belkin (2019) and Ma et al. (2018).