Exploiting a Zoo of Checkpoints for Unseen Tasks
Authors: Jiaji Huang, Qiang Qiu, Kenneth Church
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments In the previous sections, we have presented two key components, estimation of κ and MMI based selection of checkpoints. In this section, we experiment with the two components combined. First, we apply algorithm 1 to the κ estimated in section 3.3, and show its effectiveness on multiple linguistic tasks. The baseline we compare against is random selection of checkpoints, and single commonly adopted checkpoint, e.g., bert-base-uncased. Then we extend to image classification tasks. Again we observe constant improvements over random picks, and other straightforward alternative. |
| Researcher Affiliation | Collaboration | Jiaji Huang Baidu Research Sunnyvale, CA, 94089 huangjiaji@baidu.com Qiang Qiu School of Electrical and Computer Engineering Purdue University, West Lafayette, IN, 47907 qqiu@purdue.edu Kenneth Church Baidu Research Sunnyvale, CA, 94089 kennethchurch@baidu.com |
| Pseudocode | Yes | Algorithm 1 Maximum Mutual Information (MMI) based Selection of Checkpoints |
| Open Source Code | Yes | All results can be reproduced using code at https://github.com/baidu-research/task_space |
| Open Datasets | Yes | We input training set of wikitext2 as probing data, and extract the contextualized word embeddings after penultimate layer. In this section, we simulate an example using cifar100 dataset. |
| Dataset Splits | Yes | Each task has 480 (=500-20) training samples per class. There are 20 training samples held out for each class. ... Finally, the remaining 10 holdouts per-class are used as probing data to estimate κ. The performance of this task is measured by accuracy on the standard validation data (excluding classes not handled in this task), denoted as acct. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers, such as Python or library versions. |
| Experiment Setup | Yes | Following [28], we train a softmax on top of the combined word representations ( i Sfi) for each task. The gradients are not back-propagated through the checkpoints. Another design choice is that fi is taken to be the feature at top layer of the checkpoint. Each task has 480 (=500-20) training samples per class. A resnet-50 is trained for each of these seen" tasks, stored as a checkpoint. |