Efficient Beam Tree Recursion

Authors: Jishnu Ray Chowdhury, Cornelia Caragea

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Table 1, we compare the empirical time-memory trade offs of the most relevant Tree-Rv NN models (particularly those that are competent in List Ops and logical inference). We use CRv NN in the no halting mode as [66] because otherwise it can start to halt trivially because of limited training data. For the splits of lengths 200 1000 we use the data shared by Havrylov et al. [31]; for the 1500 2000 split we sample from the training set of LRA listops [81]. We can observe from the table that EBT-GRC achieves better memory efficiency among all the strong Rv NN contenders (GT-GRC and EGT-GRC fail on List Ops/Logical inference) except for OM. However, OM s memory efficiency comes with a massive cost in time, being nearly 8 times slower than EBT-GRC. Compared to BT-GRC OS s 43GB peak memory consumption in 900-1000 sequence length from [66], the memory consumption of EBT-GRC is reduced to only 2.8GB. Even compared to BT-GRC, the reduction is near ten times. EBT-GRC even outperforms the original greedy GT-GRC used in Choi et al. [9]. Removing the slicing from the full model EBT-GRC (i.e., slice) can substantially increase the memory cost. This becomes most apparent when training with higher hidden state size (compare (512) vs. (512, slice)). This shows the effectiveness of slicing.
Researcher Affiliation Academia Jishnu Ray Chowdhury Cornelia Caragea Computer Science University of Illinois Chicago jraych2@uic.edu cornelia@uic.edu
Pseudocode Yes E Pseudocode We present the pseudocode of EBT-Rv NN in Algorithm 1. Note that the algorithms are written as they are for the sake of illustration: in practice, many of the nested loops are made parallel through batched operations.
Open Source Code Yes Our code is available at the link: https://github.com/JRC1995/Beam Recursion Family.
Open Datasets Yes List Operations (List Ops): The task of List Ops consist of hierarchical nested operations that neural networks generally struggle to solve particularly in length-generalizable settings. There are only a few known contenders that achieve decent performance in the task [31, 10, 71, 66]. For this task we use the original training set [57] with the length generalization splits from Havrylov et al. [31], the argument generalization splits from Ray Chowdhury and Caragea [66], and the LRA test set from Tay et al. [81]. The different splits test the model in different out-of-distribution settings (one in unseen lengths, another in an unseen number of arguments, and another in both unseen lengths and arguments). Remarkably, as can be seen from Table 2, EBT-GRC outperforms most of the previous models in accuracy only being slightly behind OM for some argument generalization splits.
Dataset Splits Yes List Ops: List Ops was introduced by Nangia and Bowman [57] and is a task for solving nested lists of mathematical operations. It is a 10-way classification task. Similar to Chowdhury and Caragea [10], we train our models on the original training set with all samples 100 sequence lengths filtered out. We use the original development set for validation. We test on the following sets: the original test set (near-IID split); the length generalization splits from Havrylov et al. [31] that include samples of much higher lengths; the argument generalization splits from Ray Chowdhury and Caragea [66] that involve an unseen number of maximum arguments for each operator; and the LRA split (which has both higher sequence length and higher argument number) from Tay et al. [81].
Hardware Specification Yes All models were trained in a single Nvidia RTX A6000.
Software Dependencies No The paper does not list specific versions of software dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Hyperparameters are presented in Appendix G, architecture details are presented in Appendix F, task details are provided in Appendix B and additional results (besides what is presented below) in logical inference and text classification are provided in Appendix C. For sentence encoder models, we use the same hyperparameters as [66] (the preprint of the paper is available in the supplementary in anonymized form) for all the datasets. The only new hyperparameter for EBT-GRC is ds which we set as 64; otherwise the hyperparameters are the same as that of BT-GRC or BT-GRC OS. We discuss the hyperparameters of the sequence interaction models next. For EBT-GAU/EGT-GAU, we used a two-layered weight-shared GAU-Blocks for NGAUStack1/GAUStack1 and a three-layered weight-shared GAU-Blocks for GAUStack2 (for parameter efficiency and regularization). GAU uses a five-layered GAU-Blocks (weights unshared) for GAUStack so that the parameters are similar to that of EBT-GAU or EGT-GAU. We use a dropout of 0.1 after the multiplation with Wo in each GAUBlock layer and a head size dh of 128 (similar to Hua et al. [36]). For relative position, we set k = 5 (k here corresponds the receptive field for relative attention in Shaw et al. [70]) for normal GAUBlocks and k = 10 for parent attention (since parent attention is only applied to higher heights, we do not need to initialize weights for negative relative distances). Other hyperparameters are kept same as the sentence encoder models. The hyperparameters of MNLI, SNLI, and QQP are shared. Note that all the natural language tasks are trained with fixed 840B Glove Embeddings [62] as in Ray Chowdhury and Caragea [66].