Thinking Like Transformers

Authors: Gail Weiss, Yoav Goldberg, Eran Yahav

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 5, we show how a compiled RASP program can indeed be realised in a neural transformer (as in Figure 1), and occasionally is even the solution found by a transformer trained on the task using gradient descent (Figs 5 and 4). We evaluate the relation of RASP to transformers on three fronts: 1. its ability to upper bound the number of heads and layers required to solve a task, 2. the tightness of that bound, 3. its feasibility in a transformer, i.e., whether a sufficiently large transformer can encode a given RASP solution., training several transformers.
Researcher Affiliation Collaboration 1Technion, Haifa, Israel 2Bar Ilan University, Ramat Gan, Israel 3Allen Institute for AI. Correspondence to: Gail Weiss <sgailw@cs.technion.ac.il>.
Pseudocode Yes Figure 1: We consider double-histogram, the task of counting for each input token how many unique input tokens have the same frequency as itself... (a) shows a RASP program for this task... Figure 3: RASP program for the task shuffle-dyck-2 (balance 2 parenthesis pairs, independently of each other)...
Open Source Code Yes Code We provide a RASP read-evaluate-print-loop (REPL) in http://github.com/tech-srl/RASP, along with a RASP cheat sheet and link to replication code for our work.
Open Datasets No The paper defines specific tasks like “Reverse” or “Histograms” with examples, implying data generated for these tasks, but it does not provide concrete access information (link, DOI, formal citation) for any publicly available or open dataset used for training. It does not name standard benchmark datasets with citations.
Dataset Splits No The paper refers to “test accuracy” but does not provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or references to predefined splits with citations).
Hardware Specification No The provided text excerpt does not contain any specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It mentions details are relegated to Appendix A, which is not included in the provided text.
Software Dependencies No The provided text excerpt does not mention specific software dependencies with version numbers. It mentions details are relegated to Appendix A, which is not included in the provided text.
Experiment Setup No The paper states, “We relegate the exact details of the transformers and their training to Appendix A.” The provided text excerpt, which ends before Appendix A, therefore does not contain specific experimental setup details such as hyperparameters or training configurations.