reproducibilityindex.ai

Alignment at Pre-training! Towards Native Alignment for Arabic LLMs

Authors: Juhao Liang, Zhenyang Cai, Jianqing Zhu, Huang Huang, Kewei Zong, Bang An, Mosen Alharthi, Juncai He, Lian Zhang, Haizhou Li, Benyou Wang, Jinchao Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments and ablation studies to evaluate the impact of native alignment on model performance and alignment stability.
Researcher Affiliation	Academia	1Shenzhen Research Institue of Big Data, Shenzhen, China 2The Chinese University of Hong Kong, Shenzhen, China 3King Abdullah University of Science and Technology, Thuwal, Saudi Arabia 4Shenzhen International Center for Industrial and Applied Mathematics, Shenzhen Research Institute of Big Data
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Our code is available at https://github.com/Freedom Intelligence/Ace GPT-v2
Open Datasets	Yes	For language datasets, we select Arabic Text2022 from BAAI7 for Arabic, Slim Pajama [26] for English, MAP-CC [27] for Chinese, and various other language datasets from Wikipedia [28]. For mathematics and code, we choose Proof-Pile-2 [29].
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits for the main pre-training data used to train the models (e.g., 100 billion tokens of mixed-source data or 10 billion tokens of native-alignment data).
Hardware Specification	No	In this study, we utilize 2048 GPUs for data processing and model training.
Software Dependencies	No	The paper mentions software components and frameworks like Llama-3, Qwen1.5, GPT-4, LLa MA-Factory, and Opencompass, but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup	Yes	The batch size was set to 128 for both instruction tuning and DPO, with epochs set to 3. All other experimental settings followed the default settings in the framework.