Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

Authors: Xinyan Chen, Jianfei Yang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks.
Researcher Affiliation Academia Xinyan Chen , Jianfei Yang Nanyang Technological University Corresponding Author (EMAIL)
Pseudocode No The paper describes the model architecture and methods in prose, complemented by architectural diagrams (Figure 2, Figure 7), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Codes are available at: https://xyanchen.github.io/X-Fi
Open Datasets Yes We train and evaluate our proposed X-Fi on the two largest human sensing multimodal public datasets, MM-Fi (Yang et al., 2024) and XRF55 (Wang et al., 2024)
Dataset Splits Yes We follow the S1 Random Split for MM-Fi and the original split setting for XRF55 as outlined in their respective papers.
Hardware Specification Yes The training process is performed with a batch size of 16 on an NVIDIA Ge Force RTX 4090 GPU.
Software Dependencies No The paper mentions using 'Adam W optimizer' but does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks.
Experiment Setup Yes To standardize feature representations obtained by various modalityspecific feature extractors, we apply linear projection units to map each modality feature representation to nf = 32, each with a feature dimension of df = 512. [...] The backbone for both the cross-modal transformer and each modality-specific cross-attention module consists of a 1-layer decoder-only transformer structure with 8 multi-head attention heads and a scaling factor of 0.125. [...] The number of iterations on the X-Fusion block is set to a default of 4 in our experiments. The Adam W optimizer, with an initial learning rate of 1 e 3 for HPE and 1 e 4 for HAR, is chosen for model optimization. The training process is performed with a batch size of 16 on an NVIDIA Ge Force RTX 4090 GPU.