PHYRE: A New Benchmark for Physical Reasoning

Authors: Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, Ross Girshick

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test several modern learning algorithms on PHYRE and find that these algorithms fall short in solving the puzzles efficiently. We expect that PHYRE will encourage the development of novel sample-efficient agents that learn efficient but useful models of physics. For code and to play PHYRE for yourself, please visit https://player.phyre.ai. 4 Experiments We conduct experiments to obtain baseline results for within-template and cross-template generalization on the PHYRE benchmark. Experiments are performed separately on each tier. Code reproducing the results of our experiments is available from https://phyre.ai. Figure 3 presents success-percentage curves for all five agents on both PHYRE tiers (-B and -2B) in both generalization settings (within-template and cross-template): the curves show the percentage of tasks solved as a function of the number of solution attempts per task, and are computed by averaging over all 10 folds in PHYRE. Table 1a presents the corresponding mean AUCCESS (and its standard deviation).
Researcher Affiliation Industry Anton Bakhtin Laurens van der Maaten Justin Johnson Laura Gustafson Ross Girshick Facebook AI Research {yolo,lvdmaaten,jcjohns,lgustafson,rbg}@fb.com
Pseudocode No The paper describes methods in text but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes For code and to play PHYRE for yourself, please visit https://player.phyre.ai. Code reproducing the results of our experiments is available from https://phyre.ai.
Open Datasets Yes Towards this goal, we have developed the PHYRE (PHYsical REasoning) benchmark. PHYRE provides a set of physics puzzles in a simulated 2D world. Each template defines 100 such tasks. Task templates are used to measure an agent s generalization ability in two settings. In the within-template setting, an agent trains on a subset of tasks in the template and is evaluated on the remaining tasks within that template. To measure cross-template generalization, test tasks are selected exclusively from templates that were not used for training. PHYRE provides: A fully deterministic environment: agents always produce the same result on a task. A process that deterministically splits the tasks into 10 folds containing a training, validation, and test set. As a result, agents are always compared on exactly the same task splits. Task splits are available for both tiers and both generalization settings (within-template and cross-template).
Dataset Splits Yes PHYRE provides: ... A process that deterministically splits the tasks into 10 folds containing a training, validation, and test set. As a result, agents are always compared on exactly the same task splits. Task splits are available for both tiers and both generalization settings (within-template and cross-template). To avoid overfitting on test tasks, hyperparameter tuning is only to be performed based on the validation set: we discourage tuning of hyperparameters based on test task performance. For results on the test set, we use these tuned hyperparameter and train agents on the union of the training and validation sets.
Hardware Specification No The paper does not mention any specific hardware specifications (e.g., GPU models, CPU types, memory amounts) used for the experiments.
Software Dependencies No The paper mentions using
Experiment Setup Yes Following [4], we train the network by minimizing the cross-entropy between the soft prediction and the observed reward. During training, we sample batches with an equal number of positive and negative triplets. Our network comprises: (1) an action encoder that transforms the 3D or 6D (depending on the tier) action representation using a multi-layer perceptron with a single hidden layer; (2) an observation encoder that transforms the observation image into a hidden representation using a convolutional network (CNN); and (3) a fusion module that combines the action and observation representations and makes a prediction. Our action encoder is a MLP with a single hidden layer with 512 units and Re LU activations. Our observation encoder is a Res Net-18 [13]. For the fusion module, we follow [37] and use the action encoder to predict a bias and gain for each channel in the CNN. The output of the action encoder thus contains twice as many values as there are channels in the CNN at the fusion point. To expedite action ranking, we fuse both models before the last residual block of the CNN. The network is trained end-to-end using stochastic gradient descent with the Adam optimizer [22]. We anneal the learning rate to 0 using a half cosine schedule without restarts [28]. The DQN(-O) agents were trained on 100,000 actions per task. All agents are permitted to make up to 100 attempts per task.