reproducibilityindex.ai

Not a Number: Identifying Instance Features for Capability-Oriented Evaluation

Authors: Ryan Burnell, John Burden, Danaja Rutar, Konstantinos Voudouris, Lucy Cheke, José Hernández-Orallo

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we present a new methodology to identify and build informative instance features that can provide explanatory and predictive power to analyse the behaviour of AI systems more robustly. ... We illustrate this methodology with the Animal-AI competition as a representative example of how we can revisit existing competitions and benchmarks in AI even when evaluation data is sparse.
Researcher Affiliation	Academia	1Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK 2Centre for the Study of Existential Risk, University of Cambridge, UK 3VRAIN, Universitat Polit ecnica de Val encia, Spain
Pseudocode	No	The paper provides a step-by-step summary of its methodology in Section 6 and Appendix A.4, but these are descriptive text points, not structured pseudocode or algorithm blocks.
Open Source Code	Yes	The appendix includes further results and plots, and can be found with all the code and data on Github1. 1https://github.com/Ryan Burnell/Not ANumber
Open Datasets	Yes	As a proof of concept, we apply this methodology to the Animal-AI (AAI) Olympics [Crosby et al., 2020], a competition that evaluated AI agents in a 3D environment across a range of task categories, such as spatial memory and causal reasoning.
Dataset Splits	No	The paper discusses splitting data for analysis ('split the data into 75% test and 25% deployment') and mentions building 'predictive models'. However, it does not provide explicit details about training/validation/test splits for reproducibility of their own predictive models, nor does it refer to standard predefined splits for such a purpose.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud resources) used to run the experiments or analyses described.
Software Dependencies	No	The paper mentions that the 'Animal-AI (AAI) environment is built in Unity [Juliani et al., 2018]', but it does not specify the version of Unity or any other specific software dependencies with their version numbers used for their analysis or predictive modeling.
Experiment Setup	No	The paper describes the setup of the Animal-AI competition tasks and environment, which is the subject of their analysis. However, it does not provide details of the experimental setup for their own research, such as hyperparameters or training configurations for the predictive models they built (e.g., C5.0).