Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
Authors: Mahdi Karami, Ali Ghodsi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality. Our experiments demonstrate that this architecture not only outperforms traditional attention-based architectures such as BERT and Vision Transformers with smaller model sizes, but also extends the feasible sequence length beyond the limitations of the dense attention layers. |
| Researcher Affiliation | Collaboration | Mahdi Karami Google Research mahdika@google.com Ali Ghodsi School of Computer Science University of Waterloo, ON, Canada ali.ghodsi@uwaterloo.ca |
| Pseudocode | Yes | Listing 1: A basic implementation of the Orchid layer. |
| Open Source Code | Yes | We provide a basic Python implementation of the Orchid model in Appendix D. |
| Open Datasets | Yes | Orchid models are pre-trained using masked language modeling over the C4 dataset [Raffel et al., 2019] with the bert-base-uncased tokenizer. ... We evaluated the models on two widely used image classification datasets: CIFAR-10 and Image Net-1K. ... we conducted experiments on the speech classification task using the SC10 subset of the Speech Commands dataset, which contains 10 classes. |
| Dataset Splits | Yes | Orchid models are pre-trained using masked language modeling with 30% masking over the C4 dataset [52] with sequence length of 128 and the bert-base-uncased tokenizer. ... The fine-tuning process was executed in accordance with the methodology described by Izsak et al. [53]. ... For CIFAR-10, images are transformed into sequences of 4 4 pixel patches... In the case of Image Net-1K, we segmented images into patches of 16 16 pixels... |
| Hardware Specification | Yes | The Orchid models were trained Orchid on a single P100 GPU for small to medium sequence length and on a single V100 GPU for long sequences. ... Models were pre-trained on a node of 4x A100 GPUs... Models were fine-tuned on a node of 4x A100 GPUs. ... Orchid was trained on a single P100 GPU... We trained Orchid on 4x A100 GPUs... The evaluation was conducted on an NVIDIA A100-40GB GPU... |
| Software Dependencies | No | The paper mentions PyTorch and Adam optimizer but does not specify their version numbers, nor the version of the bert-base-uncased tokenizer or its library. |
| Experiment Setup | Yes | For training, we used Adam optimizer [63] with its standard settings (β1 = .9, β2 = .999), and learning rate of 5e-4 with linear warmup schedule in 1000 steps. A weight decay of 0.1 was used as a regularizer. ... Our BERT-style model, called Orchid-BERT-base, has 12 layers with a hidden size of 768... Models were pre-trained on a node of 4x A100 GPUs for 70k steps with batch size of 4096. ... For CIFAR-10, images are transformed into sequences of 4 4 pixel patches... For training, we used Adam optimizer with its standard setting (β1 = .9, β2 = .999), and base learning rate of 1e-3 with linear warmup schedule within first 10 epochs and then decay with Cosine schedule. |