Hash Layers For Large Sparse Models
Authors: Stephen Roller, Sainbayar Sukhbaatar, arthur szlam, Jason Weston
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate the training of sparse layers... We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods... We study the performance of different hashing techniques... We show our approach works well both on large language modeling and dialogue tasks, and on downstream fine-tuning tasks. |
| Researcher Affiliation | Industry | Stephen Roller Sainbayar Sukhbaatar Arthur Szlam Jason Weston Facebook AI Research |
| Pseudocode | No | The paper describes methods in text and uses mathematical formulas but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a concrete access link (e.g., GitHub URL) or an explicit statement confirming the release of source code for the methodology described in this paper. |
| Open Datasets | Yes | Pushshift.io Reddit We use a variant of Reddit discussions... made available on pushshift.io [33]... RoBERTa+cc100en Data We use the same data used to train BASE [10]... Wikitext-103 Wikitext-103 is a smaller language modeling benchmark [37]... |
| Dataset Splits | Yes | The load balancing for Switch is optimized on the validation set.Model Configuration Params Valid PPL Test PPL |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or detailed computer specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions using the 'ParlAI' platform and 'Fairseq' codebase, but it does not specify concrete version numbers for these or other software dependencies required for replication. |
| Experiment Setup | Yes | The majority of our experiments are carried out in Parl AI1 platform using an encoder-decoder Transformer framework... We refer to the one with 11 layers and embedding size of d = 1024 and FFN hidden layer size of D = 4096 as our Baseline Transformer... All experiments are run for 100k updates; a table of hyperparameters is provided in subsection B.1. |