Shapeshifter: a Parameter-efficient Transformer using Factorized Reshaped Matrices

Authors: Aliakbar Panahi, Seyran Saeedi, Tom Arodz

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validated the approach on machine translation using sequence-to-sequence models. The experiments were performed on a single V100 (longest run: 6 days) or A100 GPU (longest run: 1 day). Time overhead experiments were performed on a dedicated workstation with a single NVIDIA RTX 3090 GPU. As the basis for Shapeshifter, we used Transformer [1], with embedding and encoder/decoder layers replaced with compact representations. In Shapeshifter, we used small rank for all matrices in the multi-head self-attention blocks. For embedding matrices, which are much larger and can be reduced more effectively, we used higher rank, while aiming to keep the total size of factorized embeddings below the total size of the rest of the factorized encoder/decoder. The time overhead (see Table 1) introduced by the representation is modest, up to 17% increase in running time during training, and up to 13% during predictions with a trained model.
Researcher Affiliation Collaboration Aliakbar Panahi1,2, Seyran Saeedi1,3, Tom Arodz1, 1 Department of Computer Science, Virginia Commonwealth University, Richmond, VA 2 C3 AI, Redwood City, CA 3 Dept. of Electrical and Computer Engineering, University of California, Santa Barbara, CA
Pseudocode No The paper does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code available at: https://github.com/tarodz/shapeshifter.
Open Datasets Yes We evaluate the approach on IWLSLT 14 German-to-English (De-En) [37] and WMT 18 English-to-Romanian (En-Ro) [36] datasets. For the first dataset we use learning rate 2e-3 with batch size of 128, while for the larger En-Ro dataset we use 3e-3 with batch size of 192.
Dataset Splits No While Table 2 mentions 'dev set' (development set, often used interchangeably with validation set) for performance measurement, the paper does not specify the exact percentages or sample counts for the training, validation, and test data splits.
Hardware Specification Yes The experiments were performed on a single V100 (longest run: 6 days) or A100 GPU (longest run: 1 day). Time overhead experiments were performed on a dedicated workstation with a single NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions using 'Huggingface Transformers [40] for Py Torch [41] implementation' and 'fairseq [45]' but does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes For the first dataset we use learning rate 2e-3 with batch size of 128, while for the larger En-Ro dataset we use 3e-3 with batch size of 192. We used LAMB optimizer [42] with inverse square root scheduler, dropout 0.1, no weight decay, and 0.1 label-smoothed cross entropy loss. For the training setup we have used Adam optimizer, learning rate of 5e-4, inverse square root scheduler with 15K warmup steps, dropout 0.3, weight decay of 1e-4, and 0.1 label-smoothed cross entropy loss.