Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing
Authors: Xinyan Chen, Jianfei Yang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. |
| Researcher Affiliation | Academia | Xinyan Chen , Jianfei Yang Nanyang Technological University Corresponding Author (EMAIL) |
| Pseudocode | No | The paper describes the model architecture and methods in prose, complemented by architectural diagrams (Figure 2, Figure 7), but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are available at: https://xyanchen.github.io/X-Fi |
| Open Datasets | Yes | We train and evaluate our proposed X-Fi on the two largest human sensing multimodal public datasets, MM-Fi (Yang et al., 2024) and XRF55 (Wang et al., 2024) |
| Dataset Splits | Yes | We follow the S1 Random Split for MM-Fi and the original split setting for XRF55 as outlined in their respective papers. |
| Hardware Specification | Yes | The training process is performed with a batch size of 16 on an NVIDIA Ge Force RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' but does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | To standardize feature representations obtained by various modalityspecific feature extractors, we apply linear projection units to map each modality feature representation to nf = 32, each with a feature dimension of df = 512. [...] The backbone for both the cross-modal transformer and each modality-specific cross-attention module consists of a 1-layer decoder-only transformer structure with 8 multi-head attention heads and a scaling factor of 0.125. [...] The number of iterations on the X-Fusion block is set to a default of 4 in our experiments. The Adam W optimizer, with an initial learning rate of 1 e 3 for HPE and 1 e 4 for HAR, is chosen for model optimization. The training process is performed with a batch size of 16 on an NVIDIA Ge Force RTX 4090 GPU. |