Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning
Authors: Haowen Wang, Tao Sun, Congyun Jin, Yingbo Wang, Yibo Fan, Yunqi Xu, Yuliang Du, Cong Fan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our approach, we conducted extensive experiments on the Super Natural Instructions and the Super GLUE benchmarks. Our findings demonstrate that C-Poly outperforms fully-shared, task-specific, and skill-indistinguishable baselines, significantly enhancing the sample efficiency in multi-task learning scenarios. |
| Researcher Affiliation | Industry | Ant Group, Shanghai, China {wanghaowen.whw,suntao.sun,jincongyun.jcy,wangyingbo.wyb, fanyibo.fyb,xuyunqi.xyq,duyuliang.dyl,fancong.fan}@antgroup.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the code for their method is open-source or publicly available. |
| Open Datasets | Yes | we conducted experiments on two publicly available multitasking benchmarks: Super GLUE (Wang et al., 2019) and Super Natural-Instructions (Super NI) (Wang et al., 2022). |
| Dataset Splits | Yes | During the experiments, 100 tasks were randomly selected, and for each task, 1000 samples were randomly selected for training and another 100 were selected for evaluation purpose. |
| Hardware Specification | Yes | All experiments were conducted on a single NVIDIA Tesla A100 graphics card. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' but does not provide specific version numbers for any software components or libraries like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | In the case of vanilla Lo RA, we set the rank of the low-rank approximation, r = 8. For all Mo E-like tuning methods, we utilized in total 4 parallel Lo RAs (experts) with r = 2. In C-Poly, we set A = 3 Lo RA for task-common skills and B = 1 Lo RA for task-specific skills. ... We trained our model with cross entropy loss for only 1 epoch, and set batch size of 4 on both Super NI and Super GLUE datasets during training. The Adam W optimizer (Loshchilov & Hutter, 2017) was used, with a learning rate of 5e 5. We also employed the linear decay strategy (Loshchilov & Hutter, 2016) as the learning rate scheduler with a weight decay of 0.01 and a warmup ratio of 0.06. |