Task-Robust Pre-Training for Worst-Case Downstream Adaptation
Authors: Jianghui Wang, Yang Chen, Xingyu Xie, Cong Fang, Zhouchen Lin
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, we show both on large-scale natural language processing and computer vision datasets our method increases the metrics on worse-case downstream tasks. In this section, we subject our methods to rigorous testing through two experiments, each encompassing tasks germane to the fields of Natural Language Processing (NLP) and Computer Vision (CV). |
| Researcher Affiliation | Academia | Jianghui Wang , Yang Chen , Xingyu Xie, Cong Fang , Zhouchen Lin School of Intelligence Science and Technology, Peking University jianghuiwang.ai@gmail.com, {yangchen, xyxie, fangcong, zlin}@pku.edu.cn |
| Pseudocode | Yes | Algorithm 1 Softmax Weighted Gradient Descent |
| Open Source Code | No | The paper does not contain any explicit statement about making the source code for the described methodology publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Following previous work [18, 48, 40] , we evaluate our pre-trained models on downstream tasks using the GLUE [64] benchmarks. We select two datasets with different scales, Image Net1K [53] and Image Net S50 [25], to conduct unsupervised training upstream to see whether the minimax pre-training method can help the downstream tasks with poor performance. The classification task is evaluated on the validation part of the original dataset, while the semantic segmentation and depth estimation tasks are validated on the NYUv2 dataset [56] by fine-tuning. |
| Dataset Splits | Yes | The classification task is evaluated on the validation part of the original dataset and We fine-tune our models for 100 epochs and report the top-1 validation accuracy. |
| Hardware Specification | Yes | We set the batch size to 2048 and trained the models using 8 A100 GPUs with automatic mixed precision enabled. |
| Software Dependencies | No | The paper mentions using AdamW as an optimizer and Natural Language Toolkit (NLTK), but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The model is trained with AdamW [42] by setting β1 = 0.9, β2 = 0.999, ϵ = 1e-6, and L2 weight decay of 0.01. The learning rate is warmed up over the first 10K steps to a peak value of 1e-4, then linearly decayed. Hyperparameters for pre-training Part-of-Speech Mask BERT. (Table 4) and Hyperparameters for pre-training Multi-Modal Mask MAE. (Table 6). |