In many machine learning domains, misclassification costs are
different for different examples, in the same way that class membership
probabilities are example-dependent. In these domains, both costs and
probabilities are unknown for test examples, so both cost estimators and
probability estimators must be learned. This paper first discusses how to make
optimal decisions given cost and probability estimates, and then presents
decision tree learning methods for obtaining well-calibrated probability
estimates. The paper then explains how to obtain unbiased estimators for
example- dependent costs, taking into account the difficulty that in general,
probabilities and costs are not independent random variables, and the training
examples for which costs are known are not representative of all examples. The
latter problem is called sample selection bias in econometrics. Our solution
to it is based on Nobel prize-winning work due to the economist James Heckman.
We show that the methods we propose are successful in a comprehensive
comparison with MetaCost that uses the well-known and difficult dataset from
the KDD'98 data mining contest.
Pre-2018 CSE ID: CS2001-0664