reproducibilityindex.ai

Dynamic Layer Tying for Parameter-Efficient Transformers

Authors: Tamir David Hay, Lior Wolf

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental evaluations validate that our model modestly outperforms the baseline transformer model with regard to perplexity and drastically reduces the number of trainable parameters. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
Researcher Affiliation	Academia	Tamir David Hay & Lior Wolf Blavatnik School of Computer Science, Tel Aviv University {davidhay,wolf}@mail.tau.ac.il
Pseudocode	Yes	Algorithm 1 Q-learning driven dynamic layer tying
Open Source Code	No	No explicit statement about releasing source code or a direct link to a code repository for the methodology described in the paper was found.
Open Datasets	Yes	Wiki Text-2 (Wiki2) is a large language modeling corpus that consists of over 2 million tokens. It is derived from a snapshot of verified Good and Featured articles on Wikipedia. The dataset is widely used for training language models and serves as a standard benchmark for evaluating various NLP algorithms. Wiki Text-103 (Wiki103) is an extension of the Wiki Text-2 dataset, containing more than 100 million tokens. It is also sourced from Wikipedia articles and is considered to be one of the most comprehensive datasets for training large-scale language models. LAMBADA is designed to test the capabilities of language models in predicting the final word of a sentence, given all the preceding words in that sentence. The dataset contains approximately 10,000 examples, each a sequence of sentences extracted from books. The task is challenging as it often requires understanding the broader context provided by the preceding sentences. The 1 Billion Words dataset is a corpus of text containing approximately 1 billion tokens, sourced from news articles. It provides a diverse range of vocabulary and sentence structures, making it ideal for training robust language models.
Dataset Splits	Yes	We ran all experiments for K = 300 epochs, a batch size of 16, and k = 15 with a separate validation set used to select the best model.
Hardware Specification	Yes	Our experiments ran on 2-4 A100 GPUs for the GPT-2 based architecture and 1-4 A6000/A5000 GPUs for the BERT architecture.
Software Dependencies	No	The paper mentions "Q is an MLP" and "GPT-2 s tokenizer" but does not specify version numbers for any software dependencies like PyTorch, TensorFlow, CUDA, or specific Python libraries.
Experiment Setup	Yes	We ran all experiments for K = 300 epochs, a batch size of 16, and k = 15 with a separate validation set used to select the best model. The hyper-parameters used were: the transformer learning rate is set to 0.0001 and Q s learning rate was set to 0.001, γ = 0.99, the initial exploration probability is set to ϵ = 1.0 (explore), and as depicted in Alg. 1, the ϵ-decay factor: 0.95, and the minimal ϵ value is set to 0.1.