By Tying Embeddings You Are Assuming the Distributional Hypothesis

Authors: Francesco Bertolotti, Walter Cazzola

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we analyze both theoretically and empirically the effect of tied input-output embeddings a popular technique that reduces the model size while often improving training... Further, we complement the theoretical findings with several experiments supporting the claims.
Researcher Affiliation Academia Department of Computer Science, Università degli Studi di Milano, Milan, Italy.
Pseudocode No The paper presents theoretical proofs and experimental descriptions but does not include any pseudocode or algorithm blocks.
Open Source Code Yes The code for reproducing the experiments is available at: https://zenodo.org/records/11103163
Open Datasets Yes We chose the Bookcorpus dataset (Zhu et al., 2015), a 5GB collection of English sentences extracted from existing books.
Dataset Splits No The training split of the EXor problem is generated by considering 90% of all possible binary strings of size N. The paper defines training and test sets but does not explicitly mention a separate validation split for either the EXor problem or the Bookcorpus dataset.
Hardware Specification No The paper describes the model architectures and training setups, but does not specify any particular hardware components like CPU, GPU models, or memory.
Software Dependencies No We used AdamW (Loshchilov & Hutter, 2018) (PyTorch implementation) optimizer with 5e-4 learning rate, 1e-1 weight decay. Further, we employ a cosine learning rate scheduler (PyTorch implementation) with 1e-5 minimum learning rate and 1e4 iteration cycle.
Experiment Setup Yes The model architecture is similar to the one assumed in Theorem 4.2. We used a single layer, single head, with gelu (Hendrycks & Gimpel, 2016) activation, Transformer Encoder (Vaswani et al., 2017) architecture from the PyTorch API. ... We used AdamW (Loshchilov & Hutter, 2018) (PyTorch implementation) optimizer with 5e-4 learning rate, 1e-1 weight decay. Further, we employ a cosine learning rate scheduler (PyTorch implementation) with 1e-5 minimum learning rate and 1e4 iteration cycle. The batch size is 114 (size of the training split). We train for 1.5e5 iterations.