reproducibilityindex.ai

TAIA: Large Language Models are Out-of-Distribution Data Learners

Authors: Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate TAIA using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that TAIA achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains.
Researcher Affiliation	Academia	Shuyang Jiang Fudan University Shanghai Artificial Intelligence Laboratory shuyangjiang23@m.fudan.edu.cn Yusheng Liao Cooperative Medianet Innovation Center, Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory liao20160907@sjtu.edu.cn Ya Zhang , Yanfeng Wang, Yu Wang Cooperative Medianet Innovation Center, Shanghai Jiao Tong University Shanghai Artificial Intelligence Laboratory {ya_zhang, wangyanfeng622, yuwangsjtu}@sjtu.edu.cn
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available in https://github.com/pixas/TAIA_LLM.
Open Datasets	Yes	We choose two instruction tuning corpus to further demonstrate the high generalization of TAIA under PEFT methods. We choose Alpaca-GPT4-bilingual mixed from Alpaca-GPT4 and Alpaca GPT4-zh [49]. Apart from this, we also adopt Co T-Collection [24] which is a mixture of various tasks presented in the Chain-of-Thought [72] format. Open Maths [64] is a math instruction tuning dataset... Medical Collection is a collection of bilingual medical multiple choice question answering data... Xsum [43] is a dataset for the evaluation of abstractive single-document summarization systems. SQu AD v2.0 [54] is a collection of question-answer pairs derived from Wikipedia articles.
Dataset Splits	No	The paper describes the datasets used for training and testing, but it does not explicitly provide the specific training, validation, and test dataset splits (e.g., percentages or sample counts for each split) within a single dataset.
Hardware Specification	Yes	All experiments were conducted on 4 NVIDIA A100 GPUs with Ze RO3 [53] optimization.
Software Dependencies	No	The paper mentions ZeRO3 optimization but does not provide specific version numbers for software dependencies such as PyTorch, CUDA, or other libraries used in the experiments.
Experiment Setup	Yes	We train 1 epoch for each dataset with the maximum context set to 3072 and the batch size set to 128. We set the learning rate to 2e 4 for all runs and adopt Lo RA [21] and Mixture-of-Lo RA (Mo Lo RA)[30, 76] as representative PEFT methods. The Lo RA rank is set to 16 and Lo RA alpha is set to 32. In Mo Lo RA, we set the expert count to 4 and activate 1 during inference for all settings.