Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing
Authors: Xinyan Chen, Jianfei Yang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. |
| Researcher Affiliation | Academia | Xinyan Chen , Jianfei Yang Nanyang Technological University Corresponding Author (EMAIL) |
| Pseudocode | No | The paper describes the model architecture and methods in prose, complemented by architectural diagrams (Figure 2, Figure 7), but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are available at: https://xyanchen.github.io/X-Fi |
| Open Datasets | Yes | We train and evaluate our proposed X-Fi on the two largest human sensing multimodal public datasets, MM-Fi (Yang et al., 2024) and XRF55 (Wang et al., 2024) |
| Dataset Splits | Yes | We follow the S1 Random Split for MM-Fi and the original split setting for XRF55 as outlined in their respective papers. |
| Hardware Specification | Yes | The training process is performed with a batch size of 16 on an NVIDIA Ge Force RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' but does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | To standardize feature representations obtained by various modalityspecific feature extractors, we apply linear projection units to map each modality feature representation to nf = 32, each with a feature dimension of df = 512. [...] The backbone for both the cross-modal transformer and each modality-specific cross-attention module consists of a 1-layer decoder-only transformer structure with 8 multi-head attention heads and a scaling factor of 0.125. [...] The number of iterations on the X-Fusion block is set to a default of 4 in our experiments. The Adam W optimizer, with an initial learning rate of 1 e 3 for HPE and 1 e 4 for HAR, is chosen for model optimization. The training process is performed with a batch size of 16 on an NVIDIA Ge Force RTX 4090 GPU. |