Mapping the Timescale Organization of Neural Language Models
Authors: Hsiang-Yun Sherry Chien, Jinhan Zhang, Christopher Honey
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Therefore, we applied tools developed in neuroscience to map the processing timescales of individual units within a word-level LSTM language model. This timescale-mapping method assigned long timescales to units previously found to track long-range syntactic dependencies. Additionally, the mapping revealed a small subset of the network (less than 15% of units) with long timescales and whose function had not previously been explored. We next probed the functional organization of the network by examining the relationship between the processing timescale of units and their network connectivity. We identified two classes of long-timescale units: controller units composed a densely interconnected subnetwork and strongly projected to the rest of the network, while integrator units showed the longest timescales in the network, and expressed projection profiles closer to the mean projection profile. Ablating integrator and controller units affected model performance at different positions within a sentence, suggesting distinctive functions of these two sets of units. Finally, we tested the generalization of these results to a character-level LSTM model and models with different architectures. In summary, we demonstrated a model-free technique for mapping the timescale organization in recurrent neural networks, and we applied this method to reveal the timescale and functional organization of neural language models. |
| Researcher Affiliation | Academia | Hsiang-Yun Sherry Chien, Jinhan Zhang & Christopher. J. Honey Department of Psychological and Brain Sciences Johns Hopkins University Baltimore, MD 21218, USA {sherry.chien,jzhan205,chris.honey}@jhu.edu |
| Pseudocode | No | The paper only describes steps in regular paragraph text without structured formatting. |
| Open Source Code | Yes | The code and dataset to reproduce the experiment can be found at https://github.com/ sherrychien/LSTM_timescales |
| Open Datasets | Yes | We evaluated the internal representations generated by a pre-trained word-level LSTM language model (WLSTM, Gulordava et al., 2018) as well as a pre-trained character-level LSTM model (CLSTM, Hahn & Baroni, 2019) as they processed sentences sampled from the 427804-word (1965719-character) novel corpus: Anna Karenina by Leo Tolstoy (Tolstoy, 2016), translated from Russian to English by Constance Garnett. |
| Dataset Splits | No | The paper uses pre-trained models and evaluates on specified corpora and test sets, but it does not provide explicit training/validation/test dataset splits (percentages or counts) for the data used in their experiments, nor does it specify how the 'Anna Karenina' corpus was split by them for their specific analysis. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | As far as possible, we applied similar parameters in the GRU as were used for the LSTM by Gulordava et al. (2018): the same Wikipedia training corpus, the same loss function (i.e. cross-entropy loss), and the same hyperparameters except for a learning rate initialized to 0.1, which we found more optimal to train the GRU. The GRU model also had two layers, with 650 hidden units in each layer. |