Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

Authors: Xinyan Chen, Jianfei Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks.
Researcher Affiliation	Academia	Xinyan Chen , Jianfei Yang Nanyang Technological University Corresponding Author (EMAIL)
Pseudocode	No	The paper describes the model architecture and methods in prose, complemented by architectural diagrams (Figure 2, Figure 7), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Codes are available at: https://xyanchen.github.io/X-Fi
Open Datasets	Yes	We train and evaluate our proposed X-Fi on the two largest human sensing multimodal public datasets, MM-Fi (Yang et al., 2024) and XRF55 (Wang et al., 2024)
Dataset Splits	Yes	We follow the S1 Random Split for MM-Fi and the original split setting for XRF55 as outlined in their respective papers.
Hardware Specification	Yes	The training process is performed with a batch size of 16 on an NVIDIA Ge Force RTX 4090 GPU.
Software Dependencies	No	The paper mentions using 'Adam W optimizer' but does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks.
Experiment Setup	Yes	To standardize feature representations obtained by various modalityspecific feature extractors, we apply linear projection units to map each modality feature representation to nf = 32, each with a feature dimension of df = 512. [...] The backbone for both the cross-modal transformer and each modality-specific cross-attention module consists of a 1-layer decoder-only transformer structure with 8 multi-head attention heads and a scaling factor of 0.125. [...] The number of iterations on the X-Fusion block is set to a default of 4 in our experiments. The Adam W optimizer, with an initial learning rate of 1 e 3 for HPE and 1 e 4 for HAR, is chosen for model optimization. The training process is performed with a batch size of 16 on an NVIDIA Ge Force RTX 4090 GPU.