Dense-Exponential Random Features: Sharp Positive Estimators of the Gaussian Kernel
Authors: Valerii Likhosherstov, Krzysztof M Choromanski, Kumar Avinava Dubey, Frederick Liu, Tamas Sarlos, Adrian Weller
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate DERFs experimentally in various machine learning applications. |
| Researcher Affiliation | Collaboration | Valerii Likhosherstov University of Cambridge v.lihosherstov@gmail.com Krzysztof Choromanski Google Deep Mind & Columbia University kchoro@google.com Avinava Dubey Google Research Frederick Liu Google Research Tamas Sarlos Google Research Adrian Weller University of Cambridge & The Alan Turing Institute |
| Pseudocode | No | The paper describes algorithms and methods in prose but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain any statement about releasing open-source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | mnist and cifar10 are where x, y are random images from MNIST [19] or CIFAR10 [27]... on 8 benchmarks from UCI [21]... We train Performer-encoders and test them on the Libri Speech corpus [35]... The General Language Understanding Evaluation (GLUE) benchmark [46]... We pretrained BERT model on two publicly available datasets (see: Table 3). Books [54] Wikipedia |
| Dataset Splits | Yes | As in [31], we obtain training, validation and test splits by shuffling the raw dataset and taking 90%, 5%, 5% objects respectively. The splits are fixed for all RF methods. |
| Hardware Specification | Yes | For the Transformer setups, we use TPU cluster and JAX [7] library. All tested Transformer variants were trained and tested on a TPU pods containing 4 TPU v3 chips with JAX and on GPUs (V100). |
| Software Dependencies | No | The paper mentions using 'Num Py [25]' and 'JAX [7]' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Table 2: Hyperparameters for the base models for pre-training for all methods (e.g., # of heads 12, Hidden layer size 768, Batch size 256, Learning rate 10^-4). Our Conformer-Transducer variant was characterized by: 20 conformer layers, model_dim = 512, relative position embedding dimensionality rped = 512 and h = 8 heads. We used batch size bs = 2048 and trained with the adam optimizer on TPUs. |