15  Classification models

15.1 Discussion: Assessing classification models and precision/recall

Below, I try to explain “Classification problems”, “Precision and Recall”, and ROC and ROC-AUC (machine learning validation) by analogy …

This loosely connects to the classification models used in the ‘EA Survey Donations post/bookdown’, particularly the consideration of model performance

Spy-detection We have been asked to come up with a method for classifying spies based on observable characteristics of a suspect population (‘population’ in the statistical sense). For example, some of them have mustaches, some wear pointed hats, some have sinister cackles and lapdogs to a greater or less or degree, and some appear to be more wealthy than others… Let us suppose that: We have previous similar suspect populations that we know to be the same in nature (‘drawn from an identical pool’) to the future suspect populations that we will need to make these predictions for We ultimately learned with certainty which of the previous suspect populations turned out to be spies (To keep intuition and calculation simple) suppose that in these populations know that half of all of these people are spies, and half are completely innocent. (This is called ‘balanced classes’).

We are asked to come up with a method to use the observable characteristics to classify each individual as more or less likely to be a spy. If we are given a group of individuals’ characteristics our method needs to assign a number to each one that might be interpreted as the “likelihood this person is a spy”. For example our ‘model’ we might say “anyone with a mustache is 75% likely to be a spy, and anyone without a mustache is 25% likely to be a spy.” (If half of the population wears mustaches, this would also give us an average prediction in line with the known overall 50% spy rate.) Or, let’s make the ‘model’ a bit more complicated (but to allow ‘ordering’ predictions) … we might say … we give the probability of being a spy as

\(0.75 - 0.25*wtdev\)

if they wear a moustache and

\(0.25 - 0.25*wtdev\) otherwise.

where \(wtdev\) represents some “normalized” measure of the person’s weight relative to the average weight in the sample, We predict skinnier people are slightly more likely to be spies.

Note that, since everyone’s weight is slightly different, this will yield a slightly different projected probability for any observed individual. Thus, our model could also be seen as a “ranking” where, for any group of individuals we are given we can order them from least likely to be a spy (the fattest clean-shaven people) to most likely to be a spy (the skinniest moustache-wearer).

How should we evaluate a model, in absolute terms, or against other models? This depends on how the models predictions will be used and on the cost of getting it right or wrong in either direction.

Our model might be applied by a very soft judge (think Mr. Van Driessen from B&B) who thinks it is much worse to convict an innocent person then to set free many spies. On the other hand it might be applied by a very tight-ass judge (think the gym coach) who puts almost no cost on convicting the innocent … he just wants to cath spies. How can we conceive of the ‘success’ of our model? (edited)

We could compare across different models to see if one does ‘universally better or worse’ in a sense to be defined. Suppose each model both gives a probability of being a spy for each person. As noted, this implies that each model “rank-orders” a population of individuals in terms of most to least likely being a spy.

In a given (test) set of data, we might see something like the number line dots thing above. The person our model claims is most likely to be a spy is on the right and the one predicted least likely is on the left. In application the liberal judge (“vDriessen”) might convict only the “top 1/3” of the distribution, while the tight-ass (“Coach”) might put 2/3 of these ‘dots’ in prison. Each have made different types of mistakes in practice. vDriessen let a bunch of spies free (and falsely convicted only a few), while the Coach put a lot of innocent people in prison (and, in practice, but not in the dots example above, also let a few spies free). (edited)

We might consider that our model has some probability of being applied by various types of judges. It might be that our model performs “better” than an alternative model when applied by the liberal vDriessen but “worse” when applied by the strict Coach. On the other hand, we can imagine that some models might perform better for nearly all types of judge! How could we measure this? The ROC curve can be helpful:

Above we see models that are strictly ordered from best to worse … the lines do not cross (this need not always be the case). The blue line to picks or model that is strictly better than the orange line, which is itself strictly better than a “random classifier”. A ‘random classifier’ give us no information. If we use this we can only choose some probability to convict everyone. Using this, if we increase the rate we can convict people by 2 per 100, then on average we are convicting one more spy and one more innocent person, in our example. So, by what measure is the blue classifier ‘better’ than the orange one (and both are better than the random classifier)? (edited)

Remember, each classifier gives a ‘ranked ordering’ of the population. Suppose that we consider going down the list from the most guilty looking to the most innocent looking, and increasing the number of people that we choose to convict.

As we go down the ranking in general and decide to convict a larger share of the population we are adding some possibility of convicting and innocent person (the ‘cost’) to gain the ‘benefit’ of convicting some true spies. As depicted in the ROC curves above, as we do this the blue curve (classifier) is achieving a better tradeoff than the orange, green or red curve … (this is because it gives us a ‘better ordered list of who us more likely to be guilty’)…

As the judge using the blue curve decides to go further down the ‘blue curve list’ (paying the cost of convicting more innocent people) he also catches more spies than if he had used the ‘orange, green, or red curve list’ . The benefit to cost ratio is higher at all ‘rates of conviction’ here. (edited)

This is fine in a relative sense … we know blue is better here (better than all the others except for the purple ‘perfect classifier’ … which is equivalent to a tool that always ‘perfectly orders’ guilt so there is no tradeoff)

But:

  • What if the curves cross? and
  • How do we consider the “overall” success of these classifier models? Maybe they are all doing a terrible job.

(gotta run now, will try to get back to this … for question 2 the ‘area under the curve’ seems like a decent measure that somehow ‘averages’ the gain in insight over the random classifier for all possible types of judge, if I understand correctly)

I have been told that the following is the case:

Given randomly chosen examples from class 1 and 2, the AUC is the probability that our classifier gives a larger probability of assigning the class 1 example to class 1 (true positive) than the class 2 example (false positive)

I am reinterpreting this more simply as…

Compare a randomly chosen spy (case \(i\) where \(y_i=1\)) to a randomly chosen innocent person (case \(i\) where \(y_i=0\)) … (in the testing data). Recall that a ‘classifier’ (at least typically, e.g., as in a logistic regression) assigns to each case \(i\) some probability \(\hat(p_i)\) of being a spy, based on its features. The AUC tells me “the probability that my classifier assigns a larger probability of guilt \(\hat(p_i)\) to the randomly-chosen spy than to the randomly-chosen innocent person.”

An analogy has also been made to the Wilcoxon Rank-sum measure.