Mixture of Experts Meets Prompt-Based Continual Learning

Authors: Minh Le, An Nguyen The, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Ngo, Nhat Ho

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across various continual learning benchmarks and pre-training settings demonstrate that our approach achieves state-of-the-art performance compared to existing methods. (from Introduction) and 5 Experiments section.
Researcher Affiliation Collaboration 1 The University of Texas at Austin 2 Hanoi University of Science and Technology 3 Vin AI Research
Pseudocode Yes Algorithm 1 Hi De-Prompt s training algorithm (Appendix D)
Open Source Code Yes Our code is publicly available at https://github.com/Minhchuyentoancbn/Mo E_Prompt CL.
Open Datasets Yes We evaluate various continual learning methods on widely used CIL benchmarks, including Split CIFAR-100 [23] and Split Image Net-R [23], consistent with prior work [49]. We further explore the model s performance on fine-grained classification tasks with Split CUB-200 [48] and large inter-task differences with 5-Datasets [9].
Dataset Splits No The paper mentions 'Split CIFAR-100', 'Split Image Net-R', and 'Split CUB-200' which are common benchmarks in continual learning, and shows 'Validation loss' in Figure 3. However, it does not explicitly state the exact train/validation/test split percentages, sample counts, or detailed methodology for these splits.
Hardware Specification Yes We train and test on one NVIDIA A100 GPU for baselines and our method.
Software Dependencies No The paper states that 'Training employs an Adam optimizer (β1 = 0.9, β2 = 0.999)' and 'We leverage a pre-trained Vi T-B/16 model as the backbone', but it does not specify software dependencies with version numbers (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup Yes Training employs an Adam optimizer (β1 = 0.9, β2 = 0.999), a batch size of 128, and a constant learning rate of 0.005 for all methods except CODA-Prompt. CODA-Prompt utilizes a cosine decaying learning rate starting at 0.001. Additionally, a grid search technique was implemented to determine the most appropriate number of epochs for effective training.