Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks

Authors: Noam Wies, Yoav Levine, Amnon Shashua

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 3 clearly shows that in a practical setting, using common Transformer networks, a very large gap quickly opens between the settings with and without intermediate supervision. The employed BERT base sized Transformer architecture is a strong network that pushed the envelope on very challenging NLP tasks, and is much stronger than the theoretically analyzed RNN. Still, learning even the 32 bit subset parity task without supervision proved to be too challenging even for this network (no learning after over 2M steps), while it easily learned the task in the presence of intermediate supervision. Overall this experiment, performed on the same task on which we prove our theoretical results, reinforces their relevance to common Transformer architectures.
Researcher Affiliation Academia Noam Wies, Yoav Levine & Amnon Shashua The Hebrew University of Jerusalem {noam.wies,yoav.levine,shashua}@cs.huji.ac.il
Pseudocode Yes Algorithm 1 below describes the analyzed training procedure of our sequence-to-sequence model. This algorithm describes a straightforward SGD training procedure where, for simplicity, we analyze a variant that updates only the hidden W weights while keeping A, B, M0 frozen at initialization.
Open Source Code Yes A complete proof of all the theoretical claims was included in the appendix. We also provide the source code for the bit-subset parity experiment in https://github.com/HUJIDeep/sub_task_decomposition.
Open Datasets No The paper describes a synthetic task (bit-subset parity) for which data is generated based on specific rules, rather than using a pre-existing, publicly available dataset with a direct link or citation. It states: 'we randomly sampled a subset of d/2 predefined unique indices and then we randomly sampled non-overlapping training, validation and test datasets.'
Dataset Splits Yes Table 2: Hyper-Parameters and the random seed that examined in the experiment of learning bit-subset parity with Transformers. Validation Size min(1024, 12.5% of the data) Test Size min(1024, 12.5% of the data)
Hardware Specification No The paper does not provide specific details about the hardware used for the experiments, such as GPU models, CPU types, or memory specifications. It only mentions 'common Transformer networks' and 'BERT base sized Transformer model'.
Software Dependencies No The paper mentions using 'the standard implementation of GPT-2 from the transformers framework (Wolf et al., 2020)' but does not specify a version number for the 'transformers' library or any other software dependencies.
Experiment Setup Yes See full technical details of the training apparatus in appendix G. Table 2: Hyper-Parameters and the random seed that examined in the experiment of learning bit-subset parity with Transformers. Learning Rate 10^-6, 10^-5, 10^-4; Weight Decay 0, 10^-6, 10^-4, 10^-2; Dropout 0; Batch Size 32; Warmup Steps 1k; Total Steps 100k; Number of Layers 12; Hidden Size 768; FFN Inner Hidden Size 3072; Initializer Range 0.02; Adam β1 0.9; Adam β2 0.999; Adam ε 10^-8; Validation Size min(1024, 12.5% of the data); Test Size min(1024, 12.5% of the data); Dataset Seed 27, 67, 93.