In this project, we build a modular, scalable system that can collect, store,
and process millions of satellite images. We test the relative importance of
both of the key limitations constraining the prevailing literature by applying
this system to a data-rich environment. To overcome classic data availability
concerns, and to quantify their implications in an economically meaningful
context, we operate in a data rich environment and work with an outcome
variable directly correlated with key indicators of socioeconomic well-being.
We collect public records of sale prices of homes within the United States, and
then gradually degrade our rich sample in a range of different ways which mimic
the sampling strategies employed in actual survey-based datasets. Pairing each
house with a corresponding set of satellite images, we use image-based features
to predict housing prices within each of these degraded samples. To generalize
beyond any given featurization methodology, our system contains an independent
featurization module, which can be interchanged with any preferred image
classification tool.
Our initial findings demonstrate that while satellite imagery can be used to
predict housing prices with considerable accuracy, the size and nature of the
ground truth sample is a fundamental determinant of the usefulness of imagery
for this category of socioeconomic prediction. We quantify the returns to
improving the distribution and size of observed data, and show that the image
classification method is a second-order concern. Our results provide clear
guidance for the development of adaptive sampling strategies in data-sparse
locations where satellite-based metrics may be integrated with standard survey
data, while also suggesting that advances from image classification techniques
for satellite imagery could be further augmented by more robust sampling
strategies.