Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks
Authors: Yen-Cheng Liu, CHIH-YAO MA, Junjiao Tian, Zijian He, Zsolt Kira
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results on four different dense vision tasks showed that existing methods cannot be efficiently integrated due to the hierarchical nature of the Hierarchical Vision Transformers. To overcome this issue, we propose Polyhistor and Polyhistor-Lite, consisting of Decomposed Hyper Networks and Layer-wise Scaling Kernels, to share information across different tasks with a few trainable parameters. This leads to favorable performance improvements against existing parameter-efficient methods while using fewer trainable parameters. Specifically, Polyhistor achieves competitive accuracy compared to the state-of-the-art while only use 10% of their trainable parameters. |
| Researcher Affiliation | Collaboration | Yen-Cheng Liu Georgia Tech ycliu@gatech.edu Chih-Yao Ma Meta cyma@meta.com Junjiao Tian Georgia Tech jtian73@gatech.edu Zijian He Meta zijian@meta.com Zsolt Kira Georgia Tech zkira@gatech.edu |
| Pseudocode | No | The paper does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will publicly release our code to facilitate future research. |
| Open Datasets | Yes | Dataset. We follow prior works [30, 31] on multi-task learning for dense prediction tasks and consider PASCAL-Context [32] to construct our multi-task efficient adaptation for per-pixel benchmark. |
| Dataset Splits | No | The paper mentions training data and evaluation metrics, but it does not explicitly state the specific train/validation/test dataset splits (e.g., percentages, absolute counts, or references to predefined splits) used for the experiments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU models, or cloud computing instances. |
| Software Dependencies | No | The paper does not list specific software dependencies with their version numbers required to reproduce the experiments. |
| Experiment Setup | Yes | Training. To train our model, we use the commonly-used losses for each task. Specifically, we use the standard per-pixel cross-entropy for semantic segmentation and human part segmentation, L1 loss for surface normals estimation, and balanced cross-entropy for saliency detection. For a fair comparison, we experiment on a unified codebase implementation with the same loss functions and training iterations for all baselines and our method. (...) VPT [9] inserts tunable embeddings in the first input layer (VPT-shallow) and all layers (VPT-deep), and we select the best hyper-parameter (i.e., 50 embeddings per layer) for all results. (...) Lo RA [10] applied the low-rank decomposition on attention layers, and we select rank r = 4 and the adapter output scale (i.e., 4), which performs the best. |