Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Data-Efficient Learning with Neural Programs

Authors: Alaia Solko-Breslin, Seewon Choi, Ziyang Li, Neelay Velingker, Rajeev Alur, Mayur Naik, Eric Wong

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation shows that for the latter benchmarks, ISED has comparable performance to state-of-the-art neurosymbolic frameworks. For the former, we use adaptations of prior work on gradient approximations of black-box components as a baseline, and show that ISED achieves comparable accuracy but in a more dataand sample-efficient manner.
Researcher Affiliation Academia Alaia Solko-Breslin, Seewon Choi, Ziyang Li, Neelay Velingker, Rajeev Alur, Mayur Naik, Eric Wong University of Pennsylvania EMAIL
Pseudocode Yes We present the pseudocode of the algorithm in Algorithm 1 and describe its steps with the hand-written formula task: Algorithm 1 ISED training pipeline
Open Source Code Yes 1Code is available at https://github.com/alaiasolkobreslin/ISED. We release code for all baselines and ISED to reproduce the results reported in the paper.
Open Datasets Yes Leaf Classification. In this task, we use a dataset, which we call LEAF-ID, containing leaf images of 11 different plant species [4], containing 330 training samples and 110 testing samples. Scene Recognition. We use a dataset containing scene images from 9 different room types [16], consisting of 830 training examples and 92 testing examples. MNIST-R [13, 14] contains 11 tasks operating on inputs of images of handwritten digits from the MNIST dataset [11]. We use the Sat Net dataset consisting of 9K training samples and 500 test samples [25].
Dataset Splits No The paper provides specific training and testing splits for datasets like LEAF-ID (330 training, 110 testing), Scene Recognition (830 training, 92 testing), and MNIST-R tasks (5K training, 500 testing), but does not explicitly mention a distinct validation set or its size for most experiments. It only details 'training samples' and 'testing samples'.
Hardware Specification Yes All of our experiments were conducted on a machine with two 20-core Intel Xeon CPUs, one NVIDIA RTX 2080 Ti GPU, and 755 GB RAM.
Software Dependencies No The paper mentions 'Py Torch [17]', 'YOLOv8 [20]', and 'CLIP [19]' but does not provide their specific version numbers. It specifies GPT-4 versions as 'gpt-4-1106-preview and gpt-4o'. However, for other key software and baselines like Deep Prob Log, Scallop, A-Ne SI, NASR, and Inde Cate R, no specific version numbers are provided.
Experiment Setup Yes Unless otherwise noted, the sample count, i.e., the number of calls to the program P per training example, is fixed at 100 for all relevant methods. We use the Adam optimizer with the best learning rate among {1e 3, 5e 4, 1e 4}. We train for maximum 100 epochs, but stop early if the training saturates. For MNIST-R tasks, we used learning rate 1e 4 and trained ISED for 10 epochs, REINFORCE and Inde Cate R for 50 epochs, and A-Ne SI and NASR for 100 epochs.