Showing posts with label analytics. Show all posts
Showing posts with label analytics. Show all posts

Wednesday, September 28, 2011

SAS: Proc Logistic shows all tied

Logistic regression is used mostly for predicting binary events. I use logistic regression very often as a tool in my professional life, to predict various 0-1 outcomes. For carrying out logistic regression (and other statistical data processing jobs), I primarily use a popular statistical package called SAS. It has been around since the initial years of statistics based marketing, and has established itself as a defacto standard in the risk analytics domain.

When you run a logistic regression in SAS, it shows you a lot of interesting and important parameters in the output. Among the outputs, it will show you the parameter estimates using which you can make a prediction, various statistics involving the data entered, and the statistical confidence of the individual estimates as well as the overall model.

One of the summary reports which tells you how good the model is doing is found at the bottom of the main output -
Association of Predicted Probabilities and Observed Responses
Percent Concordant     85.6    Somers' D    0.714
Percent Discordant     14.2    Gamma        0.715
Percent Tied            0.2    Tau-a        0.279
Pairs                  7791    c            0.857


An analyst can look at all the above parameters to make a quick judgement on how well the model will perform when put to test for predicting the outcome on a new set of data. The 'c' is basically the area under ROC curve, and Somer's D corresponds to gini coefficient under certain conditions. The c should ideally vary between 0.5 to 1, with 0.5 meaning the model is not working at all.

Therefore I was puzzled when I saw this in one of my outputs -
Association of Predicted Probabilities and Observed Responses
Percent Concordant          0.0    Somers' D    -.000
Percent Discordant          0.0    Gamma        -1.00
Percent Tied              100.0    Tau-a        -.000
Pairs                 183334788    c            0.500

The c is 0.5, and Somer's D is 0 - which means the model is pretty much useless. However, all the other tables in the output (not shown here) told me that the model's performance was good! Why would this happen? This repeated couple of times recently for different models - in some cases the c was not 0.5, but still was much lower than what I would expect from experience. One indication I had was that over time my team was trying to model target events which are more and more rare. I could trace the solution to 'Percent Concordant/Discordant/Tied' being measured incorrectly in this particular table.

To measure these concordance percentages, you need to look at all possible pairs in your pool which have opposite observations (assuming binary outcome). Then you see in how many of these pairs the model predicts outcomes the way it happened (concordant), in how many the model is actually predicting the other way round and therefore is incorrect (discordant), and in how many such pairs the model score is exactly same. All the other values in this particular table are calculated using this kind of pairing. In the table above, it tells us that all the pair of observations which have different outcomes, are predicted to have exactly the same score - effectively translating into the model being completely useless.

However, when I checked the data myself, I saw these was not the case. Why would SAS report incorrect statistics in this particular table? The answer was found after lot of research - to optimize the calculations, SAS assumes two scores are identical if they have a difference of < 0.002. However if you notice there is no mention of this in the output. The documentation was also difficult to find. I suspect they added it recently along with an option that helps you to prevent this from happening (more on this later).

This could lead consequences of varying degree when the event being predicted is very rare. Since some or all of the pairs are considered as tied when the scores are within 0.002, the error will lead to wrong concordance. In all cases the result will be incorrect calculation of of c or Somer's D showing much lower predictive power (even zero in extreme case as above) than what it is really.

Ideally SAS should have done two things -
1. SAS output should mention that the table is "Estimated Association of Predicted Probabilities" or something on similar line.
2. The algorithm should vary the 0.002 threshold based on what is the overall rate of the event in the data.

What are the workarounds for the end users?

First, all analysts should be aware of it and not panic if they see strange reports in this particular table. They should then look at the other standard reports and tables they create to evaluate the model performance, disregarding this table completely.

Second, there are two approaches you can take to get the right values.

Approach 1 is to use SAS option BINWIDTH=0 with the MODEL statement in PROC LOGISTIC. Other than the fact that it can take longer as mentioned in SAS documentation (which should be okay since accuracy in this case should win over time taken), the hitch there is that the option was mentioned in SAS Documentation for version 9.22, and it does not work with SAS version 9.1.3 which my organization uses (which makes me feel that the whole explanation on this was recently added)

Approach 2 is to create your own codes/macros to calculate and report the correct values of any statistics out of these table that you normally look at (for example, gini or ROC). It should not be difficult for a seasoned analyst, and something I will recommend as a good exercise for someone with medium experience. After all, who knows what else could come as a surprise if you use SAS original procedures?

Tuesday, September 27, 2011

All about "Information Value"

In statistical data mining, sometimes we need to determine out of a set of variables which ones are best in capturing a desired behavior. For example, let's say you have a pool of customers for your credit card company, and you want to determine who out of them are about to default (i.e. refuse to pay up after possibly making a huge expense). You need to then identify which of the attributes you have on the customer can potentially identify and alert you of such behavior. One of the popular ways in which this is done by analysts is by looking at something called 'Information Value'. In the context of data mining is also sometimes referred to by the short form - InfoVal.

Definition
Information Value of \(x\) for measuring \(y\) is a number that attempts to quantify the predictive power of \(x\) in capturing \(y\). Let's assume the target variable \(y\) which we are interested in being able to measure, is a 0-1 variable (or an indicator). Let's also further assume that it is the number of accounts who will go bad in the immediate future. Let's now divide our population in 10 equal parts after sorting the entire pool by \(x\), and create the deciles. Now we are all set to define Information Value -
$$IV_x = \sum_{i=1}^{10}{\left(bad_i-good_i\right)\ln\frac{bad_i}{good_i}}$$
Here,
\(i\) runs from 1 to 10 deciles in which we have divided the data,
\(bad_i\) is the proportion of bad accounts captured in \(i\)th decile out of all bad accounts in the population,
\(good_i\) similarly is proportion of good (i.e. not bad) accounts in \(i\)th decile.

Note that the variable whose effectiveness you want to measure is getting used since it is the variable by which the entire data is sorted and divided into deciles.

How does it work?
But why does it work? You can check that if \(x\) has no information on \(y\) at all, then the \(IV\) turns will trun out to be zero. That's because when you sort by \(x\) and create deciles, the deciles are as good as random with respect to \(y\). Hence, each decile should capture 10% of total bads and 10% of total goods. So \(bad_i-good_i=0\) and \(\ln\frac{bad_i}{good_i}=0\). So the \(IV\) turns out to be zero.

On the other hand if after sorting by \(x\) some decile has higher or lower concentration of bad's than good's, then that would mean that that particular decile is different from the overall population, and \(x\) lets us create it. The decile will contribute a positive value to the summation which defines \(IV\) in the equation above. So it is clear that for a good \(x\) variable, there will be more of such deciles where the proportion of goods and bads differ - and by a larger margin as your \(x\) is more effective in capturing \(y\) - hence \(IV\) indeed gives a measure of predictive power of \(x\).

Issues
However, there is something artificial in the definition of \(IV\) above - it is the functional form. Indeed there can be many different ways to create the functional form that is being summed up.

To give some examples - \(\sum_{i=1}^{10}{\left(bad_i-good_i\right)^2}\), \(\sum_{i=1}^{10}{\left|bad_i-good_i\right|\frac{n_i}{n}}\), etc. all should be equally good candidates.

The last one in particular is interesting - because it has the proportion in the equation, making it a consistent measure. That is, if you decide to divide the data into 20 parts or 30 parts and so on, you will go closer and closer to a limit. Incidentally, the limit to which it converges is essentially gini/2 under some assumptions. For \(IV\) on the other hand, this leads to a problem - you cannot divide the data indefinitely - as you may hit segments which have no good (or no bad) accounts at all - in which case taking the \(ln\) will bomb.

Also for the same reason, it will be inaccurate and unfair to compare two variables when one of them has ties - which makes it impossible to unambiguously divide the population into 10 equal parts. (This is cautionary since it is a mistake that I have seen many analysts make, even when drawing Lorenz curves for example where the inequality of deciles can and should always be taken into account. For IV, to stress, it however cannot be taken into account.)

Origin
If \(IV\) has these problems, how did the definition come about in the first place? Also should you use it?

I can only guess how such a metric came into being, and became popular over that despite having drawbacks. The concept of information actually arises in several other branches of mathematics, eg. Information Theory where you measure a very interesting quantity called entropy (which vaguely resembles \(IV\)), and Fisher's Information of a variable (which deals with how much information a statistic contains about a particular parameter of a distribution). I think someone was aware of these concepts, and wanted to create something similar for the corporate world to measure "information captured in a variable". In the corporate world how things work in most places, is that you show something works, and it becomes de-facto unless someone else challenges it and shows something else works better. Somehow the challenges are rare to come by in most places, which is why at senior leadership there needs to be a conscious effort into engaging the employees into thinking rather than following tradition. For \(IV\), this was pretty much the case. We have already seen that it works, and after it was shown to work it became a standard - without proper math/statistical backing to explore pros and cons and to see if it can be bettered. And once something becomes a standard, it gains inertia and becomes legacy - which propels it through 'common wisdom'. The employees of that particular organization learned about it (they learned the formula, understood what it tries to measure, learned the SAS codes or SQL queries required to compute it). When they migrated, the knowledge migrated with them.

Recommendation
The only reason you might want to use \(IV\) is because of legacy - you may have worked on a bank which used it, and may have developed some reference points (i.e. thumb rules like "if \(IV\) is more than \(abc\) then the variable is good, otherwise it is bad") of good or bad \(IV\) values.Even if you have used \(IV\) before, it should not take a very long time to develop experience on reference points for gini - you will be able to do that through a course of one or two projects. I will in general though recommend not to use IV, in favor of other more intuitive measures - roc, gini, KS to name some. I personally prefer gini - it is easy to understand, is consistent, and is very robust in measuring power of a variable.

To conclude, though \(IV\) is still popular in some places of the banking world, I would recommend not using it in favor of gini/ROC which measures the same thing while being more intuitive and without the flaws of \(IV\).