Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Function-to-Style Guidance of LLMs for Code Translation
Authors: Longhui Zhang, Bin Wang, Jiahao Wang, Xiaofeng Zhao, Min Zhang, Hao Yang, Meishan Zhang, Yu Li, Jing Li, Jun Yu, Min Zhang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. Notably, our approach enables Qwen1.5B to outperform promptenhanced Qwen32B and GPT-4 on average across 20 diverse code translation scenarios. |
| Researcher Affiliation | Collaboration | 1Harbin Institute of Technology, Shenzhen, China. 2Huawei Translation Services Center, Beijing, China. 3Zhejiang University, Hangzhou, China. |
| Pseudocode | No | The paper describes its methodology in natural language and block diagrams (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing code or a link to a source-code repository for the described methodology. |
| Open Datasets | Yes | Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. ... We further evaluate F2STRANS on x Code Eval (Khan et al., 2024), as shown in Table 5. ... The latest data for the Code Net benchmark comes from 2020 (Puri et al., 2021). |
| Dataset Splits | No | In the function-oriented training, we construct approximately 5,000 code pairs for each translation scenario, such as translating from C++ to Python, with a corresponding scale of 10,000 in the style-oriented training. ... The paper does not provide specific train/test/validation splits for the datasets used in evaluation. |
| Hardware Specification | Yes | All our experiments are carried out on a machine equipped with eight NVIDIA A800-SXM4-80GB GPUs. |
| Software Dependencies | No | The paper mentions using LLMs (Qwen, GPT-4) and general concepts like Instruction Fine-tuning, but does not provide specific version numbers for any software libraries, frameworks, or environments used for implementation or experimentation (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | In the function-oriented guidance, we set the maximum algorithmic consistency label K in Eq. 1 to 5. In the style-oriented guidance, we set both the numbers of positive translations T + and negative translations T , namely m and n, to 10, with the value of α in negative translation collection construction set to 0.8 and the trade-off hyperparameter β in Eq. 5 fixed at 0.6. ... Throughout both training stages, we maintain consistent hyperparameters, employing 2 epochs and a learning rate of 1 10 5. During inference, we set the temperature of the LLMs to 0.7. |