Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Exploring the Translation Mechanism of Large Language Models
Authors: Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Yang Xiang, Min zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments reveal that translation is predominantly driven by a sparse subset of components: specialized attention heads serve critical roles in extracting source language, translation indicators, and positional features, which are then integrated and processed by specific multi-layer perceptrons (MLPs) into intermediary English-centric latent representations before ultimately yielding the final translation. The significance of these findings is underscored by the empirical demonstration that targeted fine-tuning a minimal parameter subset (< 5%) enhances translation performance while preserving general capabilities. |
| Researcher Affiliation | Academia | 1Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China 2Peng Cheng Laboratory, Shenzhen, China EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Task Steering Subspace Identification |
| Open Source Code | Yes | Code is available at this URL. |
| Open Datasets | Yes | To explore LLM translation mechanisms, we begin with word-level translation, which offers a more tractable, interpretable approach and provides a foundational first step to understanding core translation processes. Taking inspiration from the prompt design and word selection in Wendler et al. (2024), we construct a word translation dataset across five widely used languages (e.g., English (En), Chinese (Zh), Russian (Ru), German (De), and French (Fr)). |
| Dataset Splits | Yes | For training, we leverage human-parallel corpora (WMT17 WMT22, Flores-200 (Guzmรกn et al., 2019)) following Xu et al. (2024a), evaluating translation accuracy on WMT23/24 and general-domain benchmarks (MMLU (Hendrycks et al., 2021), ARC (Clark et al., 2018), SIQA (Sap et al., 2019)). |
| Hardware Specification | Yes | All experiments are conducted on a cluster of 8 NVIDIA A100 80 GB GPUs. |
| Software Dependencies | No | The paper mentions using 'Llama2-7B' and 'Llama2-13B' models and refers to 'gradient rescaling method proposed by (Yu et al., 2025)', but does not specify software dependencies like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | For model fine-tuning, we use Llama2-7B and Llama2-13B with a learning rate of 2 5, a batch size of 128, and train for 2 epochs. The warm-up ratio is set to 0.02, and weight decay is configured at 0.1. |