Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Linearity of Relation Decoding in Transformer Language Models
Authors: Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now empirically evaluate how well LREs, estimated using the approach from Section 3, can approximate relation decoding in LMs for a variety of different relations. In all of our experiments, we study autoregressive language models. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology, 2Northeastern University, 3Technion IIT, 4Harvard University. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and dataset are available at lre.baulab.info. |
| Open Datasets | Yes | To support our evaluation, we manually curate a dataset of 47 relations spanning four categories: factual associations, commonsense knowledge, implicit biases, and linguistic knowledge. Each relation is associated with a number of example subject object pairs (si, oi), as well as a prompt template that leads the language model to predict o when s is filled in (e.g., [s] plays the). When evaluating each model, we filter the dataset to examples where the language model correctly predicts the object o given the prompt. Table 1 summarizes the dataset and filtering results. Further details on dataset construction are in Appendix A. The code and dataset are available at lre.baulab.info. |
| Dataset Splits | No | The paper mentions evaluating on "new subjects s" and selecting hyperparameters using "grid-search," which implies an internal data split for validation. However, it does not explicitly state the proportions or counts for train/validation/test splits of the dataset. |
| Hardware Specification | Yes | We ran all experiments on workstations with 80GB NVIDIA A100 GPUs or 48GB A6000 GPUs using Hugging Face Transformers (Wolf et al., 2019) implemented in Py Torch (Paszke et al., 2019). |
| Software Dependencies | No | The paper mentions "Hugging Face Transformers (Wolf et al., 2019) implemented in Py Torch (Paszke et al., 2019)". However, it does not specify version numbers for these software components, which is necessary for reproducibility. |
| Experiment Setup | Yes | We estimate LREs for each relation using the method discussed in Section 3 with n = 8. While calculating W and b for an individual example we prepend the remaining n - 1 training examples as few-shot examples so that the LM is more likely to generate the answer o given a s under the relation r over other plausible tokens. We fix the scalar term β (from Equation (4)) once per LM. We also have two hyperparameters specific to each relation r; ℓr, the layer after which s is to be extracted; and ρr, the rank of the inverse W (to check causality as in Equation (7)). We select these hyperparameters with grid-search; see Appendix E for details. For each relation, we report average results over 24 trials with distinct sets of n examples randomly drawn from the dataset. |