Language Through a Prism: A Spectral Approach for Multiscale Language Representations
Authors: Alex Tamkin, Dan Jurafsky, Noah Goodman
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that signal processing provides a natural framework for separating structure across scales, enabling us to 1) disentangle scale-specific information in existing embeddings and 2) train models to learn more about particular scales. Concretely, we apply spectral filters to the activations of a neuron across an input, producing filtered embeddings that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We evaluate the content of these filtered representations through probing experiments [39, 40, 41]. For each dataset below, we encode each training example with a fixed, pretrained BERT-Base cased model [26]. We then apply a spectral filter along each dimension and train a softmax classifier to perform a particular task using each filtered representation. |
| Researcher Affiliation | Academia | Alex Tamkin Stanford University Dan Jurafsky Stanford University Noah Goodman Stanford University |
| Pseudocode | Yes | Figure 3: (b) Spectral filters are simple to incorporate into existing models. Python-style code for a low-pass filter over representations. |
| Open Source Code | No | The paper references an external PyTorch library: "https://github.com/zh217/torch-dct" (footnote 8). However, it does not state that the authors' own implementation code for the methodology described in the paper is open-source or provide a link to it. |
| Open Datasets | Yes | We use the Penn Treebank dataset [42]. We use the Switchboard Dialog Speech Acts corpus [43, 44, 45]. We use the 20 Newsgroups dataset [46]. We train on the Wiki Text-103 dataset [48] for 50k steps at a batch size of 8 with default parameters for Adam. |
| Dataset Splits | No | The paper mentions using a validation set for early stopping: "We use early stopping with a patience of one, decaying the learning rate by a factor of 2 when successive epochs do not produce a decrease in validation loss." However, it does not provide specific details about the size or percentage of this validation split for any of the datasets used. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing instance specifications. |
| Software Dependencies | No | The paper mentions using "an external Py Torch library for computing and backpropagating through the DCT and IDCT" (footnote 8, referencing https://github.com/zh217/torch-dct). However, it does not specify version numbers for PyTorch or any other key software components used in the experiments. |
| Experiment Setup | Yes | We train our probing models for a maximum of 30 epochs, using the Adam optimizer [47] with default parameters. We train on the Wiki Text-103 dataset [48] for 50k steps at a batch size of 8 with default parameters for Adam. |