Task-Robust Pre-Training for Worst-Case Downstream Adaptation

Authors: Jianghui Wang, Yang Chen, Xingyu Xie, Cong Fang, Zhouchen Lin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiments, we show both on large-scale natural language processing and computer vision datasets our method increases the metrics on worse-case downstream tasks. In this section, we subject our methods to rigorous testing through two experiments, each encompassing tasks germane to the fields of Natural Language Processing (NLP) and Computer Vision (CV).
Researcher Affiliation Academia Jianghui Wang , Yang Chen , Xingyu Xie, Cong Fang , Zhouchen Lin School of Intelligence Science and Technology, Peking University jianghuiwang.ai@gmail.com, {yangchen, xyxie, fangcong, zlin}@pku.edu.cn
Pseudocode Yes Algorithm 1 Softmax Weighted Gradient Descent
Open Source Code No The paper does not contain any explicit statement about making the source code for the described methodology publicly available, nor does it provide a link to a code repository.
Open Datasets Yes Following previous work [18, 48, 40] , we evaluate our pre-trained models on downstream tasks using the GLUE [64] benchmarks. We select two datasets with different scales, Image Net1K [53] and Image Net S50 [25], to conduct unsupervised training upstream to see whether the minimax pre-training method can help the downstream tasks with poor performance. The classification task is evaluated on the validation part of the original dataset, while the semantic segmentation and depth estimation tasks are validated on the NYUv2 dataset [56] by fine-tuning.
Dataset Splits Yes The classification task is evaluated on the validation part of the original dataset and We fine-tune our models for 100 epochs and report the top-1 validation accuracy.
Hardware Specification Yes We set the batch size to 2048 and trained the models using 8 A100 GPUs with automatic mixed precision enabled.
Software Dependencies No The paper mentions using AdamW as an optimizer and Natural Language Toolkit (NLTK), but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes The model is trained with AdamW [42] by setting β1 = 0.9, β2 = 0.999, ϵ = 1e-6, and L2 weight decay of 0.01. The learning rate is warmed up over the first 10K steps to a peak value of 1e-4, then linearly decayed. Hyperparameters for pre-training Part-of-Speech Mask BERT. (Table 4) and Hyperparameters for pre-training Multi-Modal Mask MAE. (Table 6).