Language Through a Prism: A Spectral Approach for Multiscale Language Representations

Authors: Alex Tamkin, Dan Jurafsky, Noah Goodman

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that signal processing provides a natural framework for separating structure across scales, enabling us to 1) disentangle scale-specific information in existing embeddings and 2) train models to learn more about particular scales. Concretely, we apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We evaluate the content of these filtered representations through probing experiments [39, 40, 41]. For each dataset below, we encode each training example with a fixed, pretrained BERT-Base cased model [26]. We then apply a spectral filter along each dimension and train a softmax classifier to perform a particular task using each filtered representation.
Researcher Affiliation Academia Alex Tamkin Stanford University Dan Jurafsky Stanford University Noah Goodman Stanford University
Pseudocode Yes Figure 3: (b) Spectral filters are simple to incorporate into existing models. Python-style code for a low-pass filter over representations.
Open Source Code No The paper references an external PyTorch library: "https://github.com/zh217/torch-dct" (footnote 8). However, it does not state that the authors' own implementation code for the methodology described in the paper is open-source or provide a link to it.
Open Datasets Yes We use the Penn Treebank dataset [42]. We use the Switchboard Dialog Speech Acts corpus [43, 44, 45]. We use the 20 Newsgroups dataset [46]. We train on the Wiki Text-103 dataset [48] for 50k steps at a batch size of 8 with default parameters for Adam.
Dataset Splits No The paper mentions using a validation set for early stopping: "We use early stopping with a patience of one, decaying the learning rate by a factor of 2 when successive epochs do not produce a decrease in validation loss." However, it does not provide specific details about the size or percentage of this validation split for any of the datasets used.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing instance specifications.
Software Dependencies No The paper mentions using "an external Py Torch library for computing and backpropagating through the DCT and IDCT" (footnote 8, referencing https://github.com/zh217/torch-dct). However, it does not specify version numbers for PyTorch or any other key software components used in the experiments.
Experiment Setup Yes We train our probing models for a maximum of 30 epochs, using the Adam optimizer [47] with default parameters. We train on the Wiki Text-103 dataset [48] for 50k steps at a batch size of 8 with default parameters for Adam.