Learning to Rehearse in Long Sequence Memorization

Authors: Zhu Zhang, Chang Zhou, Jianxin Ma, Zhijie Lin, Jingren Zhou, Hongxia Yang, Zhou Zhao

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of our rehearsal memory by the synthetic b Ab I task and several downstream tasks, including text/video question answering and recommendation on long sequences. In this section, we first verify our rehearsal memory on the widely-used short-sequence reasoning task b Ab I. Next, we mainly compare our approach with diverse baselines on several long-sequence reasoning tasks. We then perform ablation studies on the memory rehearsal techniques and analyze the impact of crucial hyper-parameters.
Researcher Affiliation Collaboration 1Zhejiang University, China 2DAMO Academy, Alibaba Group, China.
Pseudocode No The paper describes methods in text and uses figures to illustrate components (e.g., Figure 1 and 2), but it does not contain a formal 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper does not include any explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes The b Ab I dataset (Weston et al., 2015) is a synthetic text question answering benchmark and widely applied to evaluate the memorization and reasoning performance of MANNs. We apply the Narrative QA dataset (Koˇcisk y et al., 2018) with long input contents for long-sequence text question answering. The Activity Net-QA dataset (Yu et al., 2019) contains 5,800 videos from the Activity Net (Caba Heilbron et al., 2015). The XLong dataset (Ren et al., 2019) is sampled from the click logs on Alibaba.
Dataset Splits Yes Table 2. Performance Comparisons for Long-Sequence Text Question Answering on Narrative QA. Method Setting Val MRR Test MRR. During the training stage, we simultaneously develop self-supervised rehearsal training and task-specific reasoning training based on the memory M.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using an 'Adam optimizer' and components like 'Transformer encoder' and 'GRU unit', but does not specify any software versions for programming languages, libraries, or frameworks (e.g., Python version, TensorFlow/PyTorch versions).
Experiment Setup Yes We set the layer number of the Transformer encoder and bi-directional Transformer decoder to 3. The head number in Multi-Head Attention is set to 4. We set λ1, λ2 and λ3 to 1.0, 0.5 and 1.0, respectively. The number B of history fragments is set to 6. During training, we apply an Adam optimizer (Duchi et al., 2011) to minimize the multi-task loss Lrm, where the initial learning rate is set to 0.001. We set the dx and dmodel to 128. The number K of memory slots is set to 20. And we naturally take each sentence in input texts as a segment and the maximum length N of segments is set to 15. For our rehearsal memory, we set the dx and dmodel to 256. The number K of memory slots is set to 20. We naturally take each sentence in summaries as a segment and the maximum length N of segments is set to 20. For our rehearsal memory, we set the dx and dmodel to 256. The number K of memory slots and length N of segments are both set to 20. For our rehearsal memory, we set the dx and dmodel to 64. The number K of memory slots and length N of segments are both set to 20.