reproducibilityindex.ai

Evaluating and Improving Interactions with Hazy Oracles

Authors: Stephan J. Lemmer, Jason J. Corso

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate this new formalization and an innovative deferred inference method on the disparate tasks of Single-Target Video Object Tracking and Referring Expression Comprehension, ultimately reducing error by up to 48% without any change to the underlying model or its parameters.
Researcher Affiliation	Academia	University of Michigan, Ann Arbor, Michigan, USA lemmersj@umich.edu, jjcorso@umich.edu
Pseudocode	Yes	Algorithm 1: Calculating DEV DEV 0 DDC 1 while DDC 10 do tasks draw tasks() DEV DEV + calc error(tasks) 10(len(tasks)+1) N 0 while N < len(tasks) do cur task find task to defer(tasks, DDC) response get new input(cur task) updated task aggregate fn(cur task, response) update tasks(tasks, updated task) DEV DEV + calc error(tasks) 10(len(tasks)+1) N N + 1 end while DDC DDC + 1 end while
Open Source Code	No	The paper does not provide an explicit statement or a link to open-source code for the methodology described in this paper.
Open Datasets	Yes	Since it is the only VOT dataset, to our knowledge, that contains multiple annotations per tracked object, we perform our analysis using the crowdsourced data from Lemmer et al. (Lemmer, Song, and Corso 2021). This dataset consists of nine first-frame annotations for every video in the OTB-100 dataset (Wu, Lim, and Yang 2013). ... For the task model, our evaluation uses the UNITER architecture (Chen et al. 2020), which formulates referring expression comprehension as classification over a set of externally-provided bounding boxes. ... We train and evaluate on the Ref COCO (Kazemzadeh et al. 2014) dataset because it contains multiple references to all but one target object...
Dataset Splits	Yes	We maintain the val, test A, and test B splits from previous works (Yu et al. 2016), but note our evaluation measures per-task performance instead of per-phrase performance, making it incorrect to directly compare our results to other evaluations.
Hardware Specification	Yes	Our model is trained on a single GeForce GTX Titan XP GPU using the training settings given by the original authors with a few small modifications: we use full precision floating point operations, adjust the batch size from 128 to 64, and accumulate gradients over two steps.
Software Dependencies	No	The paper mentions using 'Scikit-Learn' but does not specify a version number for it or for other key software components like deep learning frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	For VOT, we use DBSCAN with epsilon 10 and minimum samples 20, and for our method, samples are scattered by adding a normally-distributed random value with standard deviation 7 to every dimension. For Referring Expression Comprehension, we use full precision floating point operations, adjust the batch size from 128 to 64, and accumulate gradients over two steps, performing Monte Carlo dropout with 100 passes.