Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Authors: Zhen Qin, Songlin Yang, Yiran Zhong

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. We conduct a comparative analysis between our proposed HGRN and four widely adopted sequence modeling structures, i.e., attention-based, MLP-based, FFT-based, and state-space-based. We evaluate HGRN on the WikiText-103 dataset [50] and the Pile [15] dataset for autoregressive language modeling, as well as the length extrapolation ability. To assess the accuracy and efficiency of our model in handling long-term dependencies, we utilize the LRA benchmark [78]. Additionally, we showcase the robustness of HGRN in computer vision task on the ImageNet-1k dataset.
Researcher Affiliation Collaboration 1Zhen Qin , 2Songlin Yang , 1Yiran Zhong 1Open NLPLab, Shanghai Artificial Intelligence Laboratory, 2MIT CSAIL
Pseudocode Yes Algorithm 1 Recurrent Computing
Open Source Code Yes The source code is available at https://github.com/OpenNLPLab/HGRN.
Open Datasets Yes We evaluate HGRN on the WikiText-103 dataset [50] and the Pile [15] dataset for autoregressive language modeling
Dataset Splits Yes Table 1: Results on Wikitext-103 (TNN[59] s setting). means lower is better. Model PPL (val) PPL (test) Params (M) For the autoregressive language modeling, we conducted three sets of experiments. Firstly, we validated the performance of two different-scale models on the Wikitext-103 dataset.
Hardware Specification Yes We implement our models in Pytorch [54] and train them on 8 Nvidia A100 GPUs.
Software Dependencies No The paper mentions 'Pytorch [54]' as the implementation framework but does not provide a specific version number for it or other software dependencies.
Experiment Setup Yes We adopt the same training configuration for all competitors, including batch size, learning rate, training epochs or iterations, etc. We list detailed hyper-parameters in the Appendix. Table 16: Detailed training configurations used in our experiments. Total batch size 128 2048 Number of updates/epochs 50k updates 300 epochs Warmup steps/epochs 4k steps 5 epochs Peak learning rate 5e-4 2.5e-4 Learning rate scheduler Inverse sqrt cosine Optimizer Adam Adamw Adam ϵ 1e-8 1e-8 Adam (β1, β2) (0.9, 0.98) (0.9, 0.98) Weight decay 0.2 0.1 Gradient clipping 1.0