Humans demonstrate remarkable abilities to predict physicalevents in complex scenes. Two classes of models for physicalscene understanding have recently been proposed: “IntuitivePhysics Engines”, or IPEs, which posit that people make pre-dictions by running approximate probabilistic simulations incausal mental models similar in nature to video-game physicsengines, and memory-based models, which make judgmentsbased on analogies to stored experiences of previously en-countered scenes and physical outcomes. Versions of the lat-ter have recently been instantiated in convolutional neural net-work (CNN) architectures. Here we report four experimentsthat, to our knowledge, are the first rigorous comparisonsof simulation-based and CNN-based models, where both ap-proaches are concretely instantiated in algorithms that can runon raw image inputs and produce as outputs physical judg-ments such as whether a stack of blocks will fall. Both ap-proaches can achieve super-human accuracy levels and canquantitatively predict human judgments to a similar degree,but only the simulation-based models generalize to novel sit-uations in ways that people do, and are qualitatively consis-tent with systematic perceptual illusions and judgment asym-metries that people show.