Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Neural Genetic Search in Discrete Spaces

Authors: Hyeonah Kim, Sanghyeok Choi, Jiwoo Son, Jinkyoo Park, Changhyun Kwon

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated our algorithm in three distinct domains where sequential generative models are widely applied: routing problems, red-teaming language models, and de novo molecular design. Our extensive experiments validate that NGS can serve as an effective test-time search method, fulfilling its main purpose.
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute 2Universit e de Montr eal 3KAIST 4Omelet. Correspondence to: Hyeonah Kim <EMAIL>, Sanghyeok Choi <EMAIL>, Changhyun Kwon <EMAIL>.
Pseudocode	Yes	Algorithm 1 Neural Genetic Search
Open Source Code	Yes	The code is available at https://github.com/hyeonahkimm/ngs.
Open Datasets	Yes	We benchmark our model against baselines on real-world TSP and CVRP instances from TSPLib (Reinelt, 1991) and CVRPLib-X (Uchoa et al., 2017). Following prior works (Olivecrona et al., 2017), we adopt an LSTM policy to generate SMILES sequences. Since we have a limited budget, we use 8K calls to train the policy using GFlow Nets and 2K to conduct the genetic search with NGS; see details in Appendix B.3.
Dataset Splits	No	The paper mentions that for red-teaming language models, 1,024 attack prompts are generated and evaluated, and for routing problems, models trained on instances of certain sizes are used for evaluation on other size ranges. For molecular design, 8K evaluations are allocated for training. However, explicit training/validation/test split percentages for the datasets are not provided in a reproducible format, nor are citations to standard splits for all datasets used for model training/fine-tuning.
Hardware Specification	Yes	Computing resource. We use a server with two sockets of AMD EPYC 7542 32-Core Processor, and a single GPU, the NVIDIA RTX A6000, for the routing and De novo molecular design experiments. For the red-teaming language models task, we use a cloud server with four NVIDIA A100 HBM2e 80GB PCIe gpus.
Software Dependencies	No	The paper mentions various software components and models like GPT2, Llama Guard-3-8B, Mini LMv2, Concorde, PyVRP, LKH3, and uses GNNs and LSTMs. However, it does not provide specific version numbers for these software libraries, frameworks (like PyTorch, TensorFlow), or solvers, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	Hyperparameters. For sampling, we use 1,000 for mini-batch size. We use 100 for the number of ants in ACO and the number of offspring in NGS, which makes the two algorithms have the same number of iterations: 10 when generating 1,000 candidates and 100 when 10,000 (long). Note that for TSP and CVRP, we employ the local search after solution generation for all baselines, as usually done in heatmap-based approaches. We use 100 for both population size and offspring size of NGS, 0.01 for the stochastic mutation rate µ, and 0.001 for the weight-shifting factor κ in rank-based sampling.