Machine learning classifiers are currently widely used to make decisions about individuals, across a broad variety of societal contexts: education admissions, health insurance, medical diagnosis, court decisions, marketing, face recognition, and more—and this trend is likely to continue to grow. It is now well-recognized that these machine learning models are susceptible to built-in biases that can lead to systematic discrimination against protected groups. The machine learning research community has begun to recognize this important issue and in the past few years had devoted considerable research resources towards developing principles, frameworks, and algorithmic solutions to address these problems.
In this general context, this work addresses the understudied problem of how to assess how accurate, calibrated and fair a model may be, and how much confidence we should have in this assessment given access to a limited amount of labeled data. To be specific, we propose a Bayesian framework for assessing (with uncertainty) performance metrics of black-box classifiers, which is particularly important when only a limited amount of labeled data is available. To improve label-efficiency of the assessment, we develop active Bayesian assessment strategies for an array of fundamental tasks including (1) estimation of model performance; (2) identification of model deficiencies; (3) performance comparison between groups. When unlabeled data is available, we develop a new hierarchical Bayesian methodology that leverages information from both unlabeled and labeled data.
We demonstrate that our proposed approaches need significantly fewer labels than baselines, via a series of experiments assessing the performance of modern neural classifiers (e.g., ResNet and BERT) on several standard image and text classification datasets. One particular example of how the proposed approach can be used is in the increasingly common situation where the user of a blackbox classification model needs to assess its performance from a fairness perspective, in a manner that is separate and independent from the claims made by the entity that trained the model. We demonstrate that the methodology developed in this work is well-suited to such an application.