Breaking through the learning plateaus of in-context learning in Transformer

Authors: Jingwen Fu, Tao Yang, Yuwang Wang, Yan Lu, Nanning Zheng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By conducting meticulous and controlled experiments on synthetic tasks, we note that the persistence of learning plateaus correlates with compromised functionality of the weights component. The effectiveness of these strategies is further confirmed in natural language processing tasks.
Researcher Affiliation Collaboration 1National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University 2Tsinghua University 3Microsoft Research Asia.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions leveraging 'opensouce of the Pythia 13B checkpoints' which refers to a third-party resource. It does not provide any statement or link indicating that the authors' own code for the described methodology is publicly available.
Open Datasets Yes We propose a task using the Shapes3D (Kim & Mnih, 2018) dataset for a more controllable study. The dataset is contructed based on SST (Socher et al., 2013) datasets.
Dataset Splits No The paper specifies a 'training image set (80 %)' and 'test image set (20 %)' for Shape3D but does not explicitly provide percentages or counts for a distinct validation set split of the dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments. It mentions 'computational resources' generally but no specific hardware.
Software Dependencies No The paper mentions software components like 'VAE', 'Adam optimizer', and 'GPT2 model' but does not specify their version numbers or any other software dependencies with version information required for replication.
Experiment Setup Yes We utilize a batch size of 128 and set the learning rate to 0.0001. For SST-ICL task and Word Selection task, the models are both trained using Adam W optimizer with learning rate 2e 5. We choose the batch size as 64.