Augmenting Transformers with Recursively Composed Multi-grained Representations
Authors: Xiang Hu, Qingyang Zhu, Kewei Tu, Wei Wu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on various sentence-level and span-level tasks. Evaluation results indicate that Re CAT can significantly outperform vanilla Transformer models on all span-level tasks and recursive models on natural language inference tasks. |
| Researcher Affiliation | Collaboration | Xiang Hu1, Qingyang Zhu2, Kewei Tu2 , Wei Wu1 1Ant Group 2Shanghai Tech University {aaron.hx; congyue.ww}@antgroup.com; {zhuqy; tukw}@shanghaitech.edu.cn |
| Pseudocode | Yes | Algorithm 1 Pre-training Re CAT Algorithm 2 Build cell batches |
| Open Source Code | Yes | Corresponding authors 1Code released at https://github.com/ant-research/Structured_LM_RTDT. |
| Open Datasets | Yes | Data for Pre-training. For English, we pre-train our model and baselines on WikiText103 (Merity et al., 2017). WikiText103 is split at the sentence level, and sentences longer than 200 after tokenization are discarded (about 0.04 of the original data). The total number of tokens left is 110M. |
| Dataset Splits | Yes | Table 2: Dev/Test performance for four span-level tasks on Ontonotes 5.0. All tasks are evaluated using F1 score. All models except BERT are pre-trained on wiki103 with the same setup. As the test dataset of GLUE is not published, we fine-tune all models on the training set and report the best performance (acc. by default) on the validation set. |
| Hardware Specification | Yes | We pre-train all models on 8 A100 GPUs. The total training time is evaluated on 5,000 samples with different batch sizes on 1*A100 GPU with 80G memory. |
| Software Dependencies | No | The paper mentions software components like "Transformer", "Adam optimization", and "Hugging Face" but does not provide specific version numbers for any software, libraries, or dependencies used for the experiments. |
| Experiment Setup | Yes | Our model and the Transformers are all pre-trained on Wiki-103 for 30 epochs and fine-tuned on the four tasks respectively for 20 epochs. We feed span representations through a two-layer MLP using the same setting as in Toshniwal et al. (2020). In particular, during the first 10 epochs of fine-tuning, inputs are also masked by 15% and the final loss is the summation of the downstream task loss, the parser feedback loss, and the MLM loss. For the last 10 epochs, we switch to the fast encoding mode, which is described in Appendix A.3, during which inputs are not masked anymore and the top-down parser is frozen. We use a learning rate of 5e-4 to tune the classification MLP, a learning rate of 5e-5 to tune the backbone span encoders, and a learning rate of 1e-3 to tune the top-down parser. We follow the setting in Devlin et al. (2019), using 768-dimensional embeddings, a vocabulary size of 30522, 3072-dimensional hidden layer representations, and 12 attention heads as the default setting for the Transformer. The top-down parser of our model uses a 4-layer bidirectional LSTM with 128-dimensional embeddings and a 256-dimensional hidden layer. The pruning threshold m is set to 2. Training is conducted using Adam optimization with weight decay using a learning rate of 5 10^5 for the tree encoder and 1 10^3 for the top-down parser. Input tokens are batched by length, which is set to 10240. |