Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CALM: Culturally Self-Aware Language Models

Authors: Lingzhi Shen, Xiaohao Cai, Yunfei Long, Imran Razzak, Guanming Chen, Shoaib Jameel

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments conducted on multiple cross-cultural benchmark datasets demonstrate that CALM consistently outperforms state-of-the-art methods.
Researcher Affiliation	Academia	University of Southampton, Southampton, United Kingdom Queen Mary University of London, London, United Kingdom Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Pseudocode	Yes	Algorithm 1 Pseudocode of CALM
Open Source Code	Yes	The source code is available at https://github.com/slz0925/CALM.
Open Datasets	Yes	Tasks: Following prior survey works [46, 47], we categorize cultural awareness evaluation into two domains: (i) knowledge-oriented, focusing on culturally grounded commonsense reasoning and value reasoning; and (ii) toxicity-sensitive, targeting the detection of culturally harmful content such as hate speech and social bias. Datasets: For commonsense reasoning, we adopt Culture Atlas [48], a fine-grained benchmark spanning over 2,500 ethnolinguistic groups, 193 countries, and 10,000 cities, containing cultural statements labeled as true or false across domains such as festivals, marriage, clothing, food, education, and social behaviors. It further distinguishes between general facts and context-specific assertions (e.g., age, gender, religion), enabling nuanced assessment across different resource levels. For value reasoning, we use the multilingual Uni Va R dataset [49], comprising approximately 1 million QA pairs generated by 15 LLMs in 25 languages, covering 87 human values derived from several foundational theories of cultural values [50, 51]. Paraphrased and translated prompts enhance cultural diversity, while answers are back-translated to English to support language-neutral embeddings. For hate speech detection, we employ CREHate [52], a cross-cultural English benchmark consisting of 1,580 social media posts annotated by raters from five English-speaking regions with distinct cultural backgrounds, namely Australia (AU), the United Kingdom (GB), the United States (US), South Africa (ZA), and Singapore (SG). The dataset integrates re-annotated samples from the SBIC corpus [53] and newly curated Reddit and You Tube posts collected using culture-specific hate-related keywords. For social bias detection, we use the EMGSD dataset [54], which contains 57,201 instances labeled for binary and multi-class classification across six demographic dimensions: gender, race, nationality, religion, profession, and LGBTQ+. EMGSD extends the MGSD dataset [55] with subsets from Wino Queer [56] and See GULL [57], using GPT-4 and Mistral for additional sentence generation while maintaining human-validated stereotypes annotations.
Dataset Splits	No	The paper mentions test sets for EMGSD and CREHate, but does not explicitly provide percentages or counts for training/validation/test splits within the main text or appendix.
Hardware Specification	Yes	All experiments were conducted on an NVIDIA H200 GPU cluster.
Software Dependencies	No	The paper mentions using Qwen3-32B as the backbone and Adam W for optimization, but does not specify version numbers for these or other software libraries/frameworks.
Experiment Setup	Yes	We use Qwen3-32B [83] as the backbone. Unless otherwise specified, all MLP-based projection heads are implemented as 2-layer networks with hidden size 512, Re LU activation, and a dropout rate of 0.1. ... The contrastive window applies contrastive learning separately to the explicit and latent channels using independently parameterized MLP projection heads. Positive pairs are constructed by sampling semantically similar sentences from the same cultural label (e.g., country or language group), while negative pairs are drawn from culturally mismatched examples within the same batch. We use the NT-Xent loss with temperature τ = 0.07 and batch size 64. ... In the identity alignment pool, Gumbel-softmax clustering (K = 5) is applied to both cultural streams, with temperature cosine-decayed from 1.0 to 0.2. The cluster projection head s hidden size is 256. Multi-head cross-attention (h = 8) is computed from latent to explicit clusters. Each communicative dimension contains four experts, each implemented as a 2-layer transformer block with hidden size 512 and FFN size 2048. To avoid expert collapse and promote balanced usage, we apply sparse dispatch loss with top-k activation (k = 2) and a load balancing regularization term [84]. ... All trainable modules are optimized using Adam W with a learning rate of 3 10^5, weight decay of 0.01, and linear warmup over the first 10% of training steps. Reported results are averaged over ten runs.