Hierarchically Gated Recurrent Neural Network for Sequence Modeling
Authors: Zhen Qin, Songlin Yang, Yiran Zhong
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. We conduct a comparative analysis between our proposed HGRN and four widely adopted sequence modeling structures, i.e., attention-based, MLP-based, FFT-based, and state-space-based. We evaluate HGRN on the WikiText-103 dataset [50] and the Pile [15] dataset for autoregressive language modeling, as well as the length extrapolation ability. To assess the accuracy and efficiency of our model in handling long-term dependencies, we utilize the LRA benchmark [78]. Additionally, we showcase the robustness of HGRN in computer vision task on the ImageNet-1k dataset. |
| Researcher Affiliation | Collaboration | 1Zhen Qin , 2Songlin Yang , 1Yiran Zhong 1Open NLPLab, Shanghai Artificial Intelligence Laboratory, 2MIT CSAIL |
| Pseudocode | Yes | Algorithm 1 Recurrent Computing |
| Open Source Code | Yes | The source code is available at https://github.com/OpenNLPLab/HGRN. |
| Open Datasets | Yes | We evaluate HGRN on the WikiText-103 dataset [50] and the Pile [15] dataset for autoregressive language modeling |
| Dataset Splits | Yes | Table 1: Results on Wikitext-103 (TNN[59] s setting). means lower is better. Model PPL (val) PPL (test) Params (M) For the autoregressive language modeling, we conducted three sets of experiments. Firstly, we validated the performance of two different-scale models on the Wikitext-103 dataset. |
| Hardware Specification | Yes | We implement our models in Pytorch [54] and train them on 8 Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions 'Pytorch [54]' as the implementation framework but does not provide a specific version number for it or other software dependencies. |
| Experiment Setup | Yes | We adopt the same training configuration for all competitors, including batch size, learning rate, training epochs or iterations, etc. We list detailed hyper-parameters in the Appendix. Table 16: Detailed training configurations used in our experiments. Total batch size 128 2048 Number of updates/epochs 50k updates 300 epochs Warmup steps/epochs 4k steps 5 epochs Peak learning rate 5e-4 2.5e-4 Learning rate scheduler Inverse sqrt cosine Optimizer Adam Adamw Adam ϵ 1e-8 1e-8 Adam (β1, β2) (0.9, 0.98) (0.9, 0.98) Weight decay 0.2 0.1 Gradient clipping 1.0 |