Connecting Pre-trained Language Model and Downstream Task via Properties of Representation

Authors: Chenwei Wu, Holden Lee, Rong Ge

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose and empirically validate the existence of an anchor vector in the representation space, and show that this assumption, together with properties of the downstream task, guarantees performance transfer. ... In Sections 4.2 and F, this is not true for recent large-scale pre-trained language models. ... Figure 1a plots the mean squared approximation error of the log bulk partition function. ... More experiments and discussions are provided in Section F.
Researcher Affiliation Academia Chenwei Wu Duke University cwwu@cs.duke.edu Holden Lee Johns Hopkins University hlee283@jhu.edu Rong Ge Duke University rongge@cs.duke.edu
Pseudocode No The paper includes mathematical derivations and proofs (e.g., Theorem 1, Theorem 2, Lemma 1), but it does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements about releasing open-source code or provide links to a code repository.
Open Datasets Yes We use Wiki Text-2 Merity et al. [2016] as the text corpus.
Dataset Splits No The paper mentions using 'the first 1/4 of Wiki Text-2 [Merity et al., 2016] as the input text' and calculates perplexities, but it does not specify explicit train/validation/test splits for reproducibility.
Hardware Specification No The paper mentions using 'large-scale language models' like GPT-2 and OPT, but it does not specify any particular hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper mentions using 'GPT-2 [Radford et al., 2019]' and 'OPT [Zhang et al., 2022]' models. However, it does not specify any programming languages, libraries, or software dependencies with version numbers used for the experimental setup.
Experiment Setup Yes The hidden representations we use in this experiment are the last hidden states of these models, i.e., the output of the penultimate layer. The dimension of the hidden representations ranges from 768 to 2048, and the number of tokens is about 70k. We choose the bulk words to be all the words except those having top-k probabilities and compute the optimal anchor vector using the closed-form least-squares solution. In our experiments, we use the mean squared error (MSE) to measure the approximation quality.