Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Authors: Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, Yanfeng Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple video benchmarks demonstrate the effectiveness of Mo Ma, achieving superior performance with reduced computational cost. We performed a thorough evaluation of our model across several benchmarks: Section 4.1 for standard CLIP-adapter baselines, Section 4.2 for long video baselines, and Section 4.3 for zero-shot transfer. Section 4.4 provides ablation studies to analyze our model from multiple perspectives.
Researcher Affiliation Academia 1Cooperative Medianet Innovation Center, Shanghai Jiao Tong University 2Shanghai Artificial Intelligence Laboratory 3School of Artificial Intelligence, Shanghai Jiao Tong University. Correspondence to: Jiangchao Yao <EMAIL>, Yanfeng Wang <EMAIL>.
Pseudocode No The paper describes the methodology using prose and mathematical equations (e.g., Equation (1), (2), (3), (4), (6), (7), (8), (9), (10), (11), (12)), but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository.
Open Datasets Yes We first evaluate our method on Kinetics-400 (K400) (Kay et al., 2017) and Something-Something V2 (SSv2) (Goyal et al., 2017). ... To further demonstrate the effectiveness of our method in capturing long video sequence, we evaluate our method on Breakfast (Kuehne et al., 2014) and COIN (Tang et al., 2019). ... we use the model trained on K400, and evaluate it on two relatively small video recognition datasets: HMDB51 (Kuehne et al., 2011) and UCF101 (Soomro et al., 2012).
Dataset Splits Yes Kinetics-400 (K400) ... containing 240K training videos and 20K validation videos for 400 human action categories. Something-Something V2 (SSv2) ... contains about 168.9K training videos and 24.7K validation videos for 174 classes. ... Following Video Mamba s (Li et al., 2024b) setting, we further PEFT our models trained on K400 from Table 1.
Hardware Specification Yes We use 8 Tesla V100 GPUs and fp16 precision for training.
Software Dependencies No The paper mentions 'Adam W optimizer' and 'CLIP' as tools/models used, and 'fp16 precision' for training, but it does not specify any software dependencies with their version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x).
Experiment Setup Yes Implementation Details. We use the pre-trained CLIP as our base model. We set the split window size w = 8. For SSM hyper-parameters, we set its hidden state 16, hidden dimension 384 and use gelu activation layer similar with CLIP. We adopt the same prompt in Action CLIP (Wang et al., 2021). We use Adam W optimizer with learning rage 3e-4 and weight decay 0.05. Training a model on K400 dataset with 30 epochs takes about 12 hours to converge.