reproducibilityindex.ai

By Tying Embeddings You Are Assuming the Distributional Hypothesis

Authors: Francesco Bertolotti, Walter Cazzola

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we analyze both theoretically and empirically the effect of tied input-output embeddings a popular technique that reduces the model size while often improving training... Further, we complement the theoretical findings with several experiments supporting the claims.
Researcher Affiliation	Academia	Department of Computer Science, Università degli Studi di Milano, Milan, Italy.
Pseudocode	No	The paper presents theoretical proofs and experimental descriptions but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	The code for reproducing the experiments is available at: https://zenodo.org/records/11103163
Open Datasets	Yes	We chose the Bookcorpus dataset (Zhu et al., 2015), a 5GB collection of English sentences extracted from existing books.
Dataset Splits	No	The training split of the EXor problem is generated by considering 90% of all possible binary strings of size N. The paper defines training and test sets but does not explicitly mention a separate validation split for either the EXor problem or the Bookcorpus dataset.
Hardware Specification	No	The paper describes the model architectures and training setups, but does not specify any particular hardware components like CPU, GPU models, or memory.
Software Dependencies	No	We used AdamW (Loshchilov & Hutter, 2018) (PyTorch implementation) optimizer with 5e-4 learning rate, 1e-1 weight decay. Further, we employ a cosine learning rate scheduler (PyTorch implementation) with 1e-5 minimum learning rate and 1e4 iteration cycle.
Experiment Setup	Yes	The model architecture is similar to the one assumed in Theorem 4.2. We used a single layer, single head, with gelu (Hendrycks & Gimpel, 2016) activation, Transformer Encoder (Vaswani et al., 2017) architecture from the PyTorch API. ... We used AdamW (Loshchilov & Hutter, 2018) (PyTorch implementation) optimizer with 5e-4 learning rate, 1e-1 weight decay. Further, we employ a cosine learning rate scheduler (PyTorch implementation) with 1e-5 minimum learning rate and 1e4 iteration cycle. The batch size is 114 (size of the training split). We train for 1.5e5 iterations.