Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning

Authors: Haowen Wang, Tao Sun, Congyun Jin, Yingbo Wang, Yibo Fan, Yunqi Xu, Yuliang Du, Cong Fan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our approach, we conducted extensive experiments on the Super Natural Instructions and the Super GLUE benchmarks. Our findings demonstrate that C-Poly outperforms fully-shared, task-specific, and skill-indistinguishable baselines, significantly enhancing the sample efficiency in multi-task learning scenarios.
Researcher Affiliation Industry Ant Group, Shanghai, China {wanghaowen.whw,suntao.sun,jincongyun.jcy,wangyingbo.wyb, fanyibo.fyb,xuyunqi.xyq,duyuliang.dyl,fancong.fan}@antgroup.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the code for their method is open-source or publicly available.
Open Datasets Yes we conducted experiments on two publicly available multitasking benchmarks: Super GLUE (Wang et al., 2019) and Super Natural-Instructions (Super NI) (Wang et al., 2022).
Dataset Splits Yes During the experiments, 100 tasks were randomly selected, and for each task, 1000 samples were randomly selected for training and another 100 were selected for evaluation purpose.
Hardware Specification Yes All experiments were conducted on a single NVIDIA Tesla A100 graphics card.
Software Dependencies No The paper mentions using 'Adam W optimizer' but does not provide specific version numbers for any software components or libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes In the case of vanilla Lo RA, we set the rank of the low-rank approximation, r = 8. For all Mo E-like tuning methods, we utilized in total 4 parallel Lo RAs (experts) with r = 2. In C-Poly, we set A = 3 Lo RA for task-common skills and B = 1 Lo RA for task-specific skills. ... We trained our model with cross entropy loss for only 1 epoch, and set batch size of 4 on both Super NI and Super GLUE datasets during training. The Adam W optimizer (Loshchilov & Hutter, 2017) was used, with a learning rate of 5e 5. We also employed the linear decay strategy (Loshchilov & Hutter, 2016) as the learning rate scheduler with a weight decay of 0.01 and a warmup ratio of 0.06.