Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling

Authors: Mahdi Karami, Ali Ghodsi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality. Our experiments demonstrate that this architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers.
Researcher Affiliation Collaboration Mahdi Karami Google Research mahdika@google.com Ali Ghodsi School of Computer Science University of Waterloo, ON, Canada ali.ghodsi@uwaterloo.ca
Pseudocode Yes Listing 1: A basic implementation of the Orchid layer.
Open Source Code Yes We provide a basic Python implementation of the Orchid model in Appendix D.
Open Datasets Yes Orchid models are pre-trained using masked language modeling over the C4 dataset [Raffel et al., 2019] with the bert-base-uncased tokenizer. ... We evaluated the models on two widely used image classification datasets: CIFAR-10 and Image Net-1K. ... we conducted experiments on the speech classification task using the SC10 subset of the Speech Commands dataset, which contains 10 classes.
Dataset Splits Yes Orchid models are pre-trained using masked language modeling with 30% masking over the C4 dataset [52] with sequence length of 128 and the bert-base-uncased tokenizer. ... The fine-tuning process was executed in accordance with the methodology described by Izsak et al. [53]. ... For CIFAR-10, images are transformed into sequences of 4 4 pixel patches... In the case of Image Net-1K, we segmented images into patches of 16 16 pixels...
Hardware Specification Yes The Orchid models were trained Orchid on a single P100 GPU for small to medium sequence length and on a single V100 GPU for long sequences. ... Models were pre-trained on a node of 4x A100 GPUs... Models were fine-tuned on a node of 4x A100 GPUs. ... Orchid was trained on a single P100 GPU... We trained Orchid on 4x A100 GPUs... The evaluation was conducted on an NVIDIA A100-40GB GPU...
Software Dependencies No The paper mentions PyTorch and Adam optimizer but does not specify their version numbers, nor the version of the bert-base-uncased tokenizer or its library.
Experiment Setup Yes For training, we used Adam optimizer [63] with its standard settings (β1 = .9, β2 = .999), and learning rate of 5e-4 with linear warmup schedule in 1000 steps. A weight decay of 0.1 was used as a regularizer. ... Our BERT-style model, called Orchid-BERT-base, has 12 layers with a hidden size of 768... Models were pre-trained on a node of 4x A100 GPUs for 70k steps with batch size of 4096. ... For CIFAR-10, images are transformed into sequences of 4 4 pixel patches... For training, we used Adam optimizer with its standard setting (β1 = .9, β2 = .999), and base learning rate of 1e-3 with linear warmup schedule within first 10 epochs and then decay with Cosine schedule.