By Tying Embeddings You Are Assuming the Distributional Hypothesis
Authors: Francesco Bertolotti, Walter Cazzola
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we analyze both theoretically and empirically the effect of tied input-output embeddings a popular technique that reduces the model size while often improving training... Further, we complement the theoretical findings with several experiments supporting the claims. |
| Researcher Affiliation | Academia | Department of Computer Science, Università degli Studi di Milano, Milan, Italy. |
| Pseudocode | No | The paper presents theoretical proofs and experimental descriptions but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for reproducing the experiments is available at: https://zenodo.org/records/11103163 |
| Open Datasets | Yes | We chose the Bookcorpus dataset (Zhu et al., 2015), a 5GB collection of English sentences extracted from existing books. |
| Dataset Splits | No | The training split of the EXor problem is generated by considering 90% of all possible binary strings of size N. The paper defines training and test sets but does not explicitly mention a separate validation split for either the EXor problem or the Bookcorpus dataset. |
| Hardware Specification | No | The paper describes the model architectures and training setups, but does not specify any particular hardware components like CPU, GPU models, or memory. |
| Software Dependencies | No | We used AdamW (Loshchilov & Hutter, 2018) (PyTorch implementation) optimizer with 5e-4 learning rate, 1e-1 weight decay. Further, we employ a cosine learning rate scheduler (PyTorch implementation) with 1e-5 minimum learning rate and 1e4 iteration cycle. |
| Experiment Setup | Yes | The model architecture is similar to the one assumed in Theorem 4.2. We used a single layer, single head, with gelu (Hendrycks & Gimpel, 2016) activation, Transformer Encoder (Vaswani et al., 2017) architecture from the PyTorch API. ... We used AdamW (Loshchilov & Hutter, 2018) (PyTorch implementation) optimizer with 5e-4 learning rate, 1e-1 weight decay. Further, we employ a cosine learning rate scheduler (PyTorch implementation) with 1e-5 minimum learning rate and 1e4 iteration cycle. The batch size is 114 (size of the training split). We train for 1.5e5 iterations. |