Dynamic Layer Tying for Parameter-Efficient Transformers

Authors: Tamir David Hay, Lior Wolf

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations validate that our model modestly outperforms the baseline transformer model with regard to perplexity and drastically reduces the number of trainable parameters. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
Researcher Affiliation Academia Tamir David Hay & Lior Wolf Blavatnik School of Computer Science, Tel Aviv University {davidhay,wolf}@mail.tau.ac.il
Pseudocode Yes Algorithm 1 Q-learning driven dynamic layer tying
Open Source Code No No explicit statement about releasing source code or a direct link to a code repository for the methodology described in the paper was found.
Open Datasets Yes Wiki Text-2 (Wiki2) is a large language modeling corpus that consists of over 2 million tokens. It is derived from a snapshot of verified Good and Featured articles on Wikipedia. The dataset is widely used for training language models and serves as a standard benchmark for evaluating various NLP algorithms. Wiki Text-103 (Wiki103) is an extension of the Wiki Text-2 dataset, containing more than 100 million tokens. It is also sourced from Wikipedia articles and is considered to be one of the most comprehensive datasets for training large-scale language models. LAMBADA is designed to test the capabilities of language models in predicting the final word of a sentence, given all the preceding words in that sentence. The dataset contains approximately 10,000 examples, each a sequence of sentences extracted from books. The task is challenging as it often requires understanding the broader context provided by the preceding sentences. The 1 Billion Words dataset is a corpus of text containing approximately 1 billion tokens, sourced from news articles. It provides a diverse range of vocabulary and sentence structures, making it ideal for training robust language models.
Dataset Splits Yes We ran all experiments for K = 300 epochs, a batch size of 16, and k = 15 with a separate validation set used to select the best model.
Hardware Specification Yes Our experiments ran on 2-4 A100 GPUs for the GPT-2 based architecture and 1-4 A6000/A5000 GPUs for the BERT architecture.
Software Dependencies No The paper mentions "Q is an MLP" and "GPT-2 s tokenizer" but does not specify version numbers for any software dependencies like PyTorch, TensorFlow, CUDA, or specific Python libraries.
Experiment Setup Yes We ran all experiments for K = 300 epochs, a batch size of 16, and k = 15 with a separate validation set used to select the best model. The hyper-parameters used were: the transformer learning rate is set to 0.0001 and Q s learning rate was set to 0.001, γ = 0.99, the initial exploration probability is set to ϵ = 1.0 (explore), and as depicted in Alg. 1, the ϵ-decay factor: 0.95, and the minimal ϵ value is set to 0.1.