reproducibilityindex.ai

Task-Robust Pre-Training for Worst-Case Downstream Adaptation

Authors: Jianghui Wang, Yang Chen, Xingyu Xie, Cong Fang, Zhouchen Lin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the experiments, we show both on large-scale natural language processing and computer vision datasets our method increases the metrics on worse-case downstream tasks. In this section, we subject our methods to rigorous testing through two experiments, each encompassing tasks germane to the fields of Natural Language Processing (NLP) and Computer Vision (CV).
Researcher Affiliation	Academia	Jianghui Wang , Yang Chen , Xingyu Xie, Cong Fang , Zhouchen Lin School of Intelligence Science and Technology, Peking University jianghuiwang.ai@gmail.com, {yangchen, xyxie, fangcong, zlin}@pku.edu.cn
Pseudocode	Yes	Algorithm 1 Softmax Weighted Gradient Descent
Open Source Code	No	The paper does not contain any explicit statement about making the source code for the described methodology publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	Following previous work [18, 48, 40] , we evaluate our pre-trained models on downstream tasks using the GLUE [64] benchmarks. We select two datasets with different scales, Image Net1K [53] and Image Net S50 [25], to conduct unsupervised training upstream to see whether the minimax pre-training method can help the downstream tasks with poor performance. The classification task is evaluated on the validation part of the original dataset, while the semantic segmentation and depth estimation tasks are validated on the NYUv2 dataset [56] by fine-tuning.
Dataset Splits	Yes	The classification task is evaluated on the validation part of the original dataset and We fine-tune our models for 100 epochs and report the top-1 validation accuracy.
Hardware Specification	Yes	We set the batch size to 2048 and trained the models using 8 A100 GPUs with automatic mixed precision enabled.
Software Dependencies	No	The paper mentions using AdamW as an optimizer and Natural Language Toolkit (NLTK), but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	The model is trained with AdamW [42] by setting β1 = 0.9, β2 = 0.999, ϵ = 1e-6, and L2 weight decay of 0.01. The learning rate is warmed up over the first 10K steps to a peak value of 1e-4, then linearly decayed. Hyperparameters for pre-training Part-of-Speech Mask BERT. (Table 4) and Hyperparameters for pre-training Multi-Modal Mask MAE. (Table 6).