Adaptive Sampling for Efficient Softmax Approximation
Authors: Tavor Baharav, Ryan Kang, Colin Sullivan, Mo Tiwari, Eric Luxenberg, David Tse, Mert Pilanci
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the sample efficiency improvements afforded by Adaptive Softmax on real and synthetic data to corroborate our theoretical results.In Section 5, we demonstrate the empirical advantages of our algorithm in several real-world applications, including in a multiclass classification setting and in large language models. |
| Researcher Affiliation | Collaboration | Tavor Z. Baharav Eric and Wendy Schmidt Center Broad Institute Cambridge, MA, 02142 baharav@broadinstitute.org, Eric Luxenberg Gridmatic Cupertino, CA 95014 eric@gridmatic.com |
| Pseudocode | Yes | Algorithm 1 Adaptive Softmax, Algorithm 2 Normalization Estimation, Algorithm 3 Best Arm Id, Algorithm 4 Adaptive Softmax (implementation details) |
| Open Source Code | Yes | All of our results are reproducible via a 1-line script, publicly available on Git Hub at github.com/Thrun Group/adaptive Softmax. |
| Open Datasets | Yes | The MNIST dataset, containing black and white images of handwritten digits as input and ten output classes representing all ten possible digits., The Euro SAT dataset, containing RGB satellite imagery as input and ten output classes, representing possible land types (e.g., river, residential, etc), Our task is task-generation, and we generate our queries x by using two datasets (Wikitext and Penn Treebank) with a sliding window of certain stride. |
| Dataset Splits | No | The paper mentions training models and evaluating on a 'test set', and tuning parameters on 'initial training data', but it does not provide specific details on training/validation/test dataset splits (e.g., percentages or exact counts for each split). |
| Hardware Specification | No | The paper discusses hardware implications and optimization strategies (e.g., 'SRAM on a GPU', 'tiling of our matrix') but does not specify the exact GPU/CPU models or other detailed hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Hugging Face s Auto Model For Causal LM module' but does not specify its version number or any other software dependencies with version details. |
| Experiment Setup | Yes | For the MNIST dataset, we train a shallow CNN from scratch with two convolutional blocks (Conv2d, Re Lu, Max Pool, Batch Norm)., constant multiples were applied to variance estimate within Algorithm 3 and Algorithm 2., Tuning is performed, generally, via bisection to discover the minimal factor which still satisfies our provided failure probability parameter δ. |