Adaptive Coordination in Social Embodied Rearrangement

Authors: Andrew Szot, Unnat Jain, Dhruv Batra, Zsolt Kira, Ruta Desai, Akshara Rai

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that BDP learns adaptive agents that can tackle visual coordination, and zero-shot generalize to new partners in unseen environments, achieving 35% higher success and 32% higher efficiency compared to baselines. Our experiments show BDP achieves 35% higher success and is 32% more efficient when coordinating with unseen agents compared to the approach from (Jaderberg et al., 2019), averaged over 3 tasks.
Researcher Affiliation Collaboration 1Meta AI 2Georgia Institute of Technology. Correspondence to: Andrew Szot <aszot3@gatech.edu>.
Pseudocode Yes Algorithm 1 presents the pseudocode for stage 1 and stage 2 training of BDP.
Open Source Code Yes All code is available at https://bit.ly/43v Ng Fk.
Open Datasets Yes We build on the Home Assistant Benchmark (Szot et al., 2021) in the AI Habitat simulator (Savva et al., 2019) that studies Rearrangement (Batra et al., 2020a). Social Rearrangement extends Rearrangement to a collaborative setting, where two agents coordinate to rearrange objects as efficiently as possible. We follow the standard dataset split in the Replica CAD (Szot et al., 2021) scene dataset with YCB objects (Calli et al., 2015);
Dataset Splits No The paper mentions training and evaluation datasets but does not specify a distinct validation set with exact percentages, sample counts, or a formal citation to a predefined validation split.
Hardware Specification Yes DD-PPO (Wijmans et al., 2019) is used to distribute training to 4 GPUs. Each GPU runs 32 parallel environments and collects 128 simulation steps. We run on NVIDIA V100 GPUs.
Software Dependencies No The paper mentions software like PPO, DD-PPO, ResNet18, LSTM, and AI Habitat, but it does not provide specific version numbers for these or other relevant software libraries/frameworks.
Experiment Setup Yes For all methods we use the same hyperparameters unless stated otherwise. For PPO policy optimization parameters, we use a learning rate of 0.0003, 2 epochs per-update, 2 mini-batches per-epoch, clip parameter of 0.2, an entropy coefficient of 0.001, and clip the gradient norm to a max value of 0.2. For return estimation, we use a discount factor of γ = 0.99, GAE with λ = 0.95. For the discriminator in BDP we also use a learning rate of 0.0003. BDP weighs the diversity reward by 0.01 before adding it to the task reward.