Connecting Pre-trained Language Model and Downstream Task via Properties of Representation
Authors: Chenwei Wu, Holden Lee, Rong Ge
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose and empirically validate the existence of an anchor vector in the representation space, and show that this assumption, together with properties of the downstream task, guarantees performance transfer. ... In Sections 4.2 and F, this is not true for recent large-scale pre-trained language models. ... Figure 1a plots the mean squared approximation error of the log bulk partition function. ... More experiments and discussions are provided in Section F. |
| Researcher Affiliation | Academia | Chenwei Wu Duke University cwwu@cs.duke.edu Holden Lee Johns Hopkins University hlee283@jhu.edu Rong Ge Duke University rongge@cs.duke.edu |
| Pseudocode | No | The paper includes mathematical derivations and proofs (e.g., Theorem 1, Theorem 2, Lemma 1), but it does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statements about releasing open-source code or provide links to a code repository. |
| Open Datasets | Yes | We use Wiki Text-2 Merity et al. [2016] as the text corpus. |
| Dataset Splits | No | The paper mentions using 'the first 1/4 of Wiki Text-2 [Merity et al., 2016] as the input text' and calculates perplexities, but it does not specify explicit train/validation/test splits for reproducibility. |
| Hardware Specification | No | The paper mentions using 'large-scale language models' like GPT-2 and OPT, but it does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'GPT-2 [Radford et al., 2019]' and 'OPT [Zhang et al., 2022]' models. However, it does not specify any programming languages, libraries, or software dependencies with version numbers used for the experimental setup. |
| Experiment Setup | Yes | The hidden representations we use in this experiment are the last hidden states of these models, i.e., the output of the penultimate layer. The dimension of the hidden representations ranges from 768 to 2048, and the number of tokens is about 70k. We choose the bulk words to be all the words except those having top-k probabilities and compute the optimal anchor vector using the closed-form least-squares solution. In our experiments, we use the mean squared error (MSE) to measure the approximation quality. |