Computational techniques have much to offer in addressing questions of societal significance. Many such question can be framed as prediction problems, and approached with data-driven methods. In addition to prediction, understanding human behavior is a distinguishing goal in societally-relevant domains. In this work, I describe societally-significant problems which can be solved with a collective probabilistic approach.
These problems pose many challenges to techniques which assume data independence, homogeneity and scale. In settings of societal importance, dependencies can define the data in question; from complex relationships between people, to continuity between consecutive events. Rather than being generated by single, uniform sources, data in these domains can be derived and described by heterogenous sources. Finally, though many data-driven methods depend on large amounts of observations and high-quality labels in order to guarantee quality results, in domains of critical social value it is often infeasible to gather such quantities. These challenges demand methods which can utilize data-dependencies, incorporate diverse forms of information and reason over small numbers of instances with potentially ambiguous labels.
There are also many opportunities in these domains. Models concerned with societally relevant problems can draw from the knowledge established by existing academic disciplines, from the social to the natural sciences. Such knowledge can serve to inform each step of research from choosing an appropriate problem to putting results into perspective. Furthermore, there are opportunities to obtain new insights into human behavior with the abundance of data generated by virtual and online activity, and mobile and sensor networks. The scale of this data necessitates computational methods. Methods which can leverage prior knowledge and remain efficient even with large datasets can offer much in these domains.
In my work I utilize a collective probabilistic approach for data-driven social good. This approach can capitalize on structure between data instances, rather than flattening it. Furthermore, it can readily incorporate domain knowledge which, especially when combined with a collective approach, is instrumental in learning from small datasets. When datasets are large, this approach leverages a class of probabilistic graphical model which offers efficient inference. Finally, this approach can be extended to model unobserved phenomena with latent-variable representations.
I demonstrate the benefits of this approach in three societally-relevant domains, sustainability, education and malicious behavior. While these domains are diverse, the problems they present share several commonalities which are critical in data-driven modeling. For example, modeling data structure, from spatial relationships to social interactions, can reduce issues of sparsity and noise. Domain knowledge can also combat these issues, in addition to improving model interpretability. I show the benefits of domain knowledge in discovering sustainable products, predicting course performance and detecting cyberbullying. In both the domain of sustainability and malicious behavior, I demonstrate how to utilize spatio-temporal structure in the seemingly distinct tasks of disaggregating appliances and predicting the movements of human traffickers. In education and malicious behavior, I show how unobserved social structure is instrumental in not only modeling learning and aggression, but in interpreting these dynamics in groups. In all three domains I show how to model, represent and interpret latent structure. Thus, while making contributions to each problem setting and domain, I also contribute to the broader goal of data-driven modeling for social good.