Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
By Tying Embeddings You Are Assuming the Distributional Hypothesis
Authors: Francesco Bertolotti, Walter Cazzola
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we analyze both theoretically and empirically the effect of tied input-output embeddings a popular technique that reduces the model size while often improving training... Further, we complement the theoretical findings with several experiments supporting the claims. |
| Researcher Affiliation | Academia | Department of Computer Science, Università degli Studi di Milano, Milan, Italy. |
| Pseudocode | No | The paper presents theoretical proofs and experimental descriptions but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for reproducing the experiments is available at: https://zenodo.org/records/11103163 |
| Open Datasets | Yes | We chose the Bookcorpus dataset (Zhu et al., 2015), a 5GB collection of English sentences extracted from existing books. |
| Dataset Splits | No | The training split of the EXor problem is generated by considering 90% of all possible binary strings of size N. The paper defines training and test sets but does not explicitly mention a separate validation split for either the EXor problem or the Bookcorpus dataset. |
| Hardware Specification | No | The paper describes the model architectures and training setups, but does not specify any particular hardware components like CPU, GPU models, or memory. |
| Software Dependencies | No | We used AdamW (Loshchilov & Hutter, 2018) (PyTorch implementation) optimizer with 5e-4 learning rate, 1e-1 weight decay. Further, we employ a cosine learning rate scheduler (PyTorch implementation) with 1e-5 minimum learning rate and 1e4 iteration cycle. |
| Experiment Setup | Yes | The model architecture is similar to the one assumed in Theorem 4.2. We used a single layer, single head, with gelu (Hendrycks & Gimpel, 2016) activation, Transformer Encoder (Vaswani et al., 2017) architecture from the PyTorch API. ... We used AdamW (Loshchilov & Hutter, 2018) (PyTorch implementation) optimizer with 5e-4 learning rate, 1e-1 weight decay. Further, we employ a cosine learning rate scheduler (PyTorch implementation) with 1e-5 minimum learning rate and 1e4 iteration cycle. The batch size is 114 (size of the training split). We train for 1.5e5 iterations. |