Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multilingual LLMs Inherently Reward In-Language Time-Sensitive Semantic Alignment for Low-Resource Languages

Authors: Ashutosh Bajpai, Tanmoy Chakraborty

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical evidence underscores the superior performance of CLi TSSA compared to established baselines across three languages Romanian, German, and French, encompassing three temporal tasks and including a diverse set of four contemporaneous LLMs. This marks a significant step forward in addressing resource disparity in the context of temporal reasoning across languages.
Researcher Affiliation	Collaboration	Ashutosh Bajpai1,2, Tanmoy Chakraborty1 1 Indian Institute of Technology Delhi, India 2 Wipro Research, India EMAIL, EMAIL
Pseudocode	No	The paper describes methods and procedures in narrative text, without presenting any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Source code and dataset are available at https://github.com/abiitd/clitssa.
Open Datasets	Yes	1Source code and dataset are available at https://github.com/abiitd/clitssa.
Dataset Splits	Yes	Table 2: Dataset statistics for m TEMPREASON. Train Dev Test Time Range 1014-2022 634-2023 998-2023 L1 400,000 4,000 4,000 L2 16,017 5,521 5,397 L3 13,014 4,437 4,426
Hardware Specification	No	The paper mentions various LLMs used (LLa MA3-8B, Mistral-v1, Vicuna-7b-v1.5, Bloomz-7b1) but does not provide any specific details about the hardware (GPUs, CPUs, memory, etc.) on which these models were run or fine-tuned.
Software Dependencies	No	The paper mentions using the T5 model, multilingual Sentence-BERT, and distiluse-base-multilingual-cased-v1 as foundational models, but does not specify their version numbers or other software dependencies with versions.
Experiment Setup	Yes	A three-shot ICL approach is used throughout the experimental setting, demonstrating superior outcomes compared to both one-shot and two-shot configurations. The value of h and w is set empirically at 30 and 10, respectively. To fine-tune the CLi TSSA retriever model, the distiluse-base-multilingual-cased-v1 serves as the foundational model. This method is systematically applied to each low-resource language across temporal tasks L1, L2 and L3, to ensure optimum performance. Additionally, an integrated CLi TSSA retriever is fine-tuned across languages and temporal tasks. The Train and Dev datasets from m TEMPREASON are used to construct the parallel corpus to fine-tune the CLi TSSA retriever, with a separate held-out test set employed to benchmark all outcomes. We use word level F1 scores and exact match (EM) standards to quantify the LLM s responses. Please refer to the technical appendix for ablations on few-shots, parameters h & w, along with hyperparameters in detail.