Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bisecle: Binding and Separation in Continual Learning for Video Language Understanding

Authors: Yue Tan, Xiaoqian Hu, Hao Xue, Celso de Melo, Flora D. Salim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several Video QA benchmarks. Extensive Experiments. We conduct extensive experiments to validate the effectiveness of Bisecle, demonstrating significant improvements in mitigating forgetting and enhancing cross-task generalization across three Video QA benchmarks.
Researcher Affiliation	Collaboration	Yue Tan School of Computer Science University of New South Wales Sydney, Australia EMAIL Celso De Melo DEVCOM Army Research Laboratory USA EMAIL
Pseudocode	No	The paper describes the methodology in prose and mathematical equations. There are no explicit sections or figures labeled "Pseudocode" or "Algorithm".
Open Source Code	Yes	Anonymized code and preprocessed data are publicly available. Original datasets are cited in Section 4.
Open Datasets	Yes	Datasets. We conduct experiments on three Video QA datasets, i.e., NEx T-QA [54], Drama QA [55], and STAR [56].
Dataset Splits	Yes	For NEx T-QA, we split questions into eight task types (e.g., causal why/how, temporal what/when, and descriptive where/how many). Following prior work [9], we adopt the task order <TP, CW, DC, TC, DL, DO, TN, CH>. For Drama QA, we partition questions into five types and use the order with maximum forgetting, i.e., <What, Who, Where, How, Why>. For STAR, we follow its reasoning tasks <Interaction, Sequence, Prediction, Feasibility> to evaluate situational understanding in the continual learning scenarios. More details are in Appendix A.1.1.
Hardware Specification	Yes	We conduct all experiments on two NVIDIA H100 GPUs. Experiments are conducted on two NVIDIA H100 GPUs (94GB of memory per GPU). The GPU-hours for training is around 500 in total.
Software Dependencies	No	The paper mentions software components like "LLa MA-Adapter", "LLa MA-2-7B", "Vi T-L/14", and "Adam W optimizer", but does not provide specific version numbers for these.
Experiment Setup	Yes	Implementation Details. We use LLa MA-Adapter [11] as our backbone model, following [9]. We use the pre-trained LLa MA-2-7B [59] as the LLM and Vi T-L/14 [60, 61] as the visual encoder, both of which are fixed during the continual learning process. All models are trained for five epochs with a batch size of 32 on all datasets. The number of adapter layers is set to 32, the adapter length is 10, and the weight decay is 0.14. We conduct all experiments on two NVIDIA H100 GPUs. Detailed experimental settings can be found in Appendix A. Appendix A.1.4: We use dataset-specific batch sizes together with Adam W across all tasks. In particular, for NEx T-QA we set the batch size to 32, for Drama QA to 4, and for STAR to 16. All experiments employ the Adam W optimizer with a base learning rate of 0.09. Weight decay is 0.14 for NEx T-QA and 0.10 for both Drama QA and STAR. Video inputs consist of 10 frames resized to 224 224, and token sequences are truncated or padded to 128 tokens for NEx T-QA, 280 for Drama QA, and 150 for STAR. We train each model for 5 epochs (with 2 warm-up epochs) and fix the random seed to 0 for all tasks. Experiments are conducted on two NVIDIA H100 GPUs (94GB of memory per GPU). The GPU-hours for training is around 500 in total. Table 9: Training Details of Contrastive Learning. It specifies # Task Types, Task Type Embedding Size, Negative Temperature, and Contrastive Loss Weight for each dataset.