Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exploring the Translation Mechanism of Large Language Models

Authors: Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Yang Xiang, Min zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments reveal that translation is predominantly driven by a sparse subset of components: specialized attention heads serve critical roles in extracting source language, translation indicators, and positional features, which are then integrated and processed by specific multi-layer perceptrons (MLPs) into intermediary English-centric latent representations before ultimately yielding the final translation. The significance of these findings is underscored by the empirical demonstration that targeted fine-tuning a minimal parameter subset (< 5%) enhances translation performance while preserving general capabilities.
Researcher Affiliation	Academia	1Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China 2Peng Cheng Laboratory, Shenzhen, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Task Steering Subspace Identification
Open Source Code	Yes	Code is available at this URL.
Open Datasets	Yes	To explore LLM translation mechanisms, we begin with word-level translation, which offers a more tractable, interpretable approach and provides a foundational first step to understanding core translation processes. Taking inspiration from the prompt design and word selection in Wendler et al. (2024), we construct a word translation dataset across five widely used languages (e.g., English (En), Chinese (Zh), Russian (Ru), German (De), and French (Fr)).
Dataset Splits	Yes	For training, we leverage human-parallel corpora (WMT17 WMT22, Flores-200 (Guzmán et al., 2019)) following Xu et al. (2024a), evaluating translation accuracy on WMT23/24 and general-domain benchmarks (MMLU (Hendrycks et al., 2021), ARC (Clark et al., 2018), SIQA (Sap et al., 2019)).
Hardware Specification	Yes	All experiments are conducted on a cluster of 8 NVIDIA A100 80 GB GPUs.
Software Dependencies	No	The paper mentions using 'Llama2-7B' and 'Llama2-13B' models and refers to 'gradient rescaling method proposed by (Yu et al., 2025)', but does not specify software dependencies like Python, PyTorch, or CUDA versions.
Experiment Setup	Yes	For model fine-tuning, we use Llama2-7B and Llama2-13B with a learning rate of 2 5, a batch size of 128, and train for 2 epochs. The warm-up ratio is set to 0.02, and weight decay is configured at 0.1.