Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rethinking Out-of-Distribution Detection and Generalization with Collective Behavior Dynamics

Authors: Zhenbin Wang, Lei Zhang, Wei Huang, Zhao Zhang, Zizhou Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated on six benchmark datasets, our method rivals the So TA approaches for OOD generalization and can be seamlessly integrated with them to deliver additional gains. The code is available at https://github.com/wongzbb/CBD.
Researcher Affiliation	Academia	Zhenbin Wang1, Lei Zhang1,2 , Wei Huang1, Zhao Zhang1, Zizhou Wang3 1Machine Intelligence Laboratory, Sichuan University, Chengdu, China 2Tianfu Jincheng Laboratory, Chengdu, China 3Institute of High Performance Computing, A*STAR, Singapore EMAIL, EMAIL
Pseudocode	Yes	C Algorithm Protocol Algo. 1 and Algo. 2 give the algorithmic protocol of our framework, which is easy to implement and applicable to common OOD problems. Algorithm 1 Algorithm pseudocode of CBD-De Algorithm 2 Algorithm pseudocode of CBD-Gen
Open Source Code	Yes	The code is available at https://github.com/wongzbb/CBD.
Open Datasets	Yes	For OOD detection, following the latest Open OOD [94], we evaluated our CBD-De on three In D datasets: CIFAR-10, CIFAR-100 [38], and Image Net-1k [12]. For CIFAR-10 and CIFAR-100, we employed Res Net-18 and assessed performance against four OOD datasets: MNIST [13], SVHN [59], Textures [10], and Places365 [101]. For Image Net-1k, we utilized pre-trained Res Net-50 and Vit-b16 models, evaluating against three OOD datasets: i Naturalist [79], Textures, and Open Image-O [84]. For OOD generalization, we adhere to the Domainbed [23] in our experimental setup and use BCE as classification loss, evaluating our CBD-Gen on five datasets: VLCS [16], PACS [45], Office Home [82], Terra Inc [5], and Domain Net [60].
Dataset Splits	Yes	For CIFAR-10 and CIFAR-100, ... we use the full training set (50,000 images) for model training. From the official test set, 1,000 samples are reserved as the In D validation set, while the remaining 9,000 images are used as the In D test set. For OOD validation, we follow the Open OOD protocol by selecting 1,000 images from Tiny Image Net [41]... For Image Net-1k, ... we use 45,000 images from the official Image Net validation set as the In D test set, while the remaining 5,000 images are held out as the In D validation set. To facilitate hyperparameter tuning and avoid information leakage from the test sets, we construct an OOD validation set using 1,763 images from Open Image-O [84]... OOD Generalization. We adopt the training and evaluation protocol established in Domain Bed [23], ensuring consistency in dataset splits, training schedules, and model selection criteria. Specifically, we follow the leave-one-domain-out evaluation strategy, where the model is trained on all but one domain and evaluated on the held-out domain to assess its generalization ability to unseen distributions.
Hardware Specification	Yes	Our experiments are conducted using a combination of NVIDIA Ge Force RTX 3090 Ti and NVIDIA A100 GPUs. Specifically, we utilized four 3090 Ti GPUs and one A100 GPU with 40GB of memory. The OOD detection experiments are performed exclusively on the 3090 Ti GPUs, while the OOD generalization experiments were carried out on both the 3090 Ti and A100 GPUs.
Software Dependencies	No	Our formulation of steady-state prediction is originally defined through PDE. To make the solution computationally tractable, we replace the numerical PDE integration with an MLP-based steady-state predictor that directly approximates the stationary solution in a single forward pass. To assess the fidelity of this approximation, we conduct experiments using a numerical PDE solver implemented with Deep XDE [53]... We use the official pretrained weights released by Py Torch to ensure reproducibility.
Experiment Setup	Yes	OOD Detection. We train the distribution prediction head Fθ and the potential prediction head ϕθ using the Adam optimizer with a learning rate of 5 10 4 and a batch size of 128, adhering to standard post-hoc configurations. Training proceeds for 50 epochs. OOD Generalization. In our experiments, the wavevector k in the dispersion relation loss Ldisp is randomly sampled from a Gaussian distribution per batch, with a scaling factor of 0.5 and a wavenumber N of 250. The loss weights α and β are both set to 0.01. Additional hyperparameters, including learning rate, weight decay, and dropout rate, are tuned following [9] and detailed in Table 3. We adopt early stopping and utilize the Adam optimizer. Consistent with Domain Bed, the batch size is set to 32 for all datasets, except for Domain Net, which uses a batch size of 24.