Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Authors: Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, Yanfeng Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple video benchmarks demonstrate the effectiveness of Mo Ma, achieving superior performance with reduced computational cost. We performed a thorough evaluation of our model across several benchmarks: Section 4.1 for standard CLIP-adapter baselines, Section 4.2 for long video baselines, and Section 4.3 for zero-shot transfer. Section 4.4 provides ablation studies to analyze our model from multiple perspectives.
Researcher Affiliation	Academia	1Cooperative Medianet Innovation Center, Shanghai Jiao Tong University 2Shanghai Artificial Intelligence Laboratory 3School of Artificial Intelligence, Shanghai Jiao Tong University. Correspondence to: Jiangchao Yao <EMAIL>, Yanfeng Wang <EMAIL>.
Pseudocode	No	The paper describes the methodology using prose and mathematical equations (e.g., Equation (1), (2), (3), (4), (6), (7), (8), (9), (10), (11), (12)), but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code or a link to a code repository.
Open Datasets	Yes	We first evaluate our method on Kinetics-400 (K400) (Kay et al., 2017) and Something-Something V2 (SSv2) (Goyal et al., 2017). ... To further demonstrate the effectiveness of our method in capturing long video sequence, we evaluate our method on Breakfast (Kuehne et al., 2014) and COIN (Tang et al., 2019). ... we use the model trained on K400, and evaluate it on two relatively small video recognition datasets: HMDB51 (Kuehne et al., 2011) and UCF101 (Soomro et al., 2012).
Dataset Splits	Yes	Kinetics-400 (K400) ... containing 240K training videos and 20K validation videos for 400 human action categories. Something-Something V2 (SSv2) ... contains about 168.9K training videos and 24.7K validation videos for 174 classes. ... Following Video Mamba s (Li et al., 2024b) setting, we further PEFT our models trained on K400 from Table 1.
Hardware Specification	Yes	We use 8 Tesla V100 GPUs and fp16 precision for training.
Software Dependencies	No	The paper mentions 'Adam W optimizer' and 'CLIP' as tools/models used, and 'fp16 precision' for training, but it does not specify any software dependencies with their version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x).
Experiment Setup	Yes	Implementation Details. We use the pre-trained CLIP as our base model. We set the split window size w = 8. For SSM hyper-parameters, we set its hidden state 16, hidden dimension 384 and use gelu activation layer similar with CLIP. We adopt the same prompt in Action CLIP (Wang et al., 2021). We use Adam W optimizer with learning rage 3e-4 and weight decay 0.05. Training a model on K400 dataset with 30 epochs takes about 12 hours to converge.