The visual system does not require extensive signal in its inputs to compute rich, three-dimensional (3D) shape percepts. Even under highly degraded stimuli conditions, we can accurately interpret images in terms of volumetric objects. What computations support such broad generalization in the visual system? To answer, we exploit two degraded image modalities – silhouettes and two-tone “Mooney” images – alongside regular shaded images. We test two distinct approaches to vision: deep networks for classification and analysis-by-synthesis for scene inference. Deep networks perform substantially sub-human even after training on 18 times more images per category compared to the existing large-scale image sets for object classification. We also present a novel analysis-by-synthesis architecture that infers 3D scenes from images via optimization in a differentiable, physically-based renderer. This model also performs substantially sub-human. Nevertheless, both approaches can explain some of the key behavioral patterns. We discuss the insights these results provide for reverse-engineering visual cognition.