InversionView: A General-Purpose Method for Reading Information from Neural Activations

Authors: Xinting Huang, Madhur Panwar, Navin Goyal, Michael Hahn

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present four case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we show that Inversion View can reveal clear information contained in activations, including basic information about tokens appearing in the context, as well as more complex information, such as the count of certain tokens, their relative positions, and abstract knowledge about the subject. We also provide causally verified circuits to confirm the decoded information.
Researcher Affiliation Collaboration Xinting Huang Saarland University xhuang@lst.uni-saarland.de Madhur Panwar EPFL madhur.panwar@epfl.ch Navin Goyal Microsoft Research India navingo@microsoft.com Michael Hahn Saarland University mhahn@lst.uni-saarland.de
Pseudocode No Does the paper contain STRUCTURED PSEUDOCODE OR ALGORITHM BLOCKS (clearly labeled algorithm sections or code-like formatted procedures)? Answer: [No]
Open Source Code Yes Code is available at https://github.com/huangxt39/Inversion View
Open Datasets Yes To train the decoder model, we collect text from 3 datasets, including the factual statements from COUNTERFACT [37] and BEAR [56], as well as general text from Mini Pile [30].
Dataset Splits Yes We created 1.56M instances and applied a 75%-25% train-test split; test set accuracy is 99.53% (Details in Appendix D).
Hardware Specification Yes We ran all experiments on NVIDIA A100 cards.
Software Dependencies No Does the paper provide SPECIFIC ANCILLARY SOFTWARE DETAILS (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment? Answer: [No] Justification: No version information for software dependencies (e.g., PyTorch, CUDA) were found.
Experiment Setup Yes The model is trained with a batch size of 128 for 100 epochs, using a constant learning rate of 0.0005, weight decay of 0.01, and Adam W [36] optimizer.