Trainable Transformer in Transformer

Authors: Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, Sanjeev Arora

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct end-to-end experiments to validate the internal fine-tuning procedure of TINT on various language modeling and downstream tasks. We evaluate the performance of the TINTs constructed using GPT2 and OPT-125M as auxiliary models. The findings from our experiments in the language modeling and in-context learning settings confirm that fine-tuning with the simulated gradients (Section 4) still allows for effective learning in the auxiliary model.
Researcher Affiliation Academia 1Department of Computer Science, Princeton University. Correspondence to: Abhishek Panigrahi <ap34@cs.princeton.edu>, Sadhika Malladi <smalladi@cs.princeton.edu>.
Pseudocode No The paper describes algorithms and operations but does not present them in a formalized pseudocode or algorithm block format.
Open Source Code No To facilitate further work, a modular and extensible codebase for TINT is included. Notably, we instantiate TINT in a highly extensible codebase, making TINT the first such construction to undergo end-to-end evaluation.
Open Datasets Yes We perform language modeling experiments on WIKITEXT-103 (Merity et al., 2016). We evaluate 7 classification tasks for zero-shot and few-shot settings: SST-2 (Socher et al., 2013), MR (Pang and Lee, 2004), CR (Hu and Liu, 2004), MPQA (Wiebe et al., 2005), Amazon Polarity (Zhang et al., 2015), AGNews (Zhang et al., 2015), and Subj (Pang and Lee, 2005).
Dataset Splits Yes Given a batch of training datapoints ξ1, , ξB and a validation input ξ , we compute and apply gradient updates on the auxiliary model θaux for timesteps t = 0, ..., N 1 as. For example, given an input Machine learning is a useful tool for solving problems., we use the red part as the training data ξ1, and the brown part as the validation data ξ .
Hardware Specification Yes All the experiments are conducted on a single A100 80G GPU.
Software Dependencies No The paper does not provide specific software names with version numbers for reproducibility.
Experiment Setup Yes Grid search is performed for each seed to determine the optimal learning rate for both constructed models and dynamic evaluation. The learning rates considered for the learning rate hyperparameter in the descent update operations in TINT are 1e 3, 1e 4, 1e 5. Additionally, we explore various layer-step combinations to allocate a fixed budget for one full forward pass. Specifically, we update the top 3 layers for 4 steps, the top 6 layers for 3 steps, or 12 layers for 1 step.