Logistic regression is used mostly for predicting binary events. I use logistic regression very often as a tool in my professional life, to predict various 0-1 outcomes. For carrying out logistic regression (and other statistical data processing jobs), I primarily use a popular statistical package called SAS. It has been around since the initial years of statistics based marketing, and has established itself as a defacto standard in the risk analytics domain.
When you run a logistic regression in SAS, it shows you a lot of interesting and important parameters in the output. Among the outputs, it will show you the parameter estimates using which you can make a prediction, various statistics involving the data entered, and the statistical confidence of the individual estimates as well as the overall model.
One of the summary reports which tells you how good the model is doing is found at the bottom of the main output -
An analyst can look at all the above parameters to make a quick judgement on how well the model will perform when put to test for predicting the outcome on a new set of data. The 'c' is basically the area under ROC curve, and Somer's D corresponds to gini coefficient under certain conditions. The c should ideally vary between 0.5 to 1, with 0.5 meaning the model is not working at all.
Therefore I was puzzled when I saw this in one of my outputs -
The c is 0.5, and Somer's D is 0 - which means the model is pretty much useless. However, all the other tables in the output (not shown here) told me that the model's performance was good! Why would this happen? This repeated couple of times recently for different models - in some cases the c was not 0.5, but still was much lower than what I would expect from experience. One indication I had was that over time my team was trying to model target events which are more and more rare. I could trace the solution to 'Percent Concordant/Discordant/Tied' being measured incorrectly in this particular table.
To measure these concordance percentages, you need to look at all possible pairs in your pool which have opposite observations (assuming binary outcome). Then you see in how many of these pairs the model predicts outcomes the way it happened (concordant), in how many the model is actually predicting the other way round and therefore is incorrect (discordant), and in how many such pairs the model score is exactly same. All the other values in this particular table are calculated using this kind of pairing. In the table above, it tells us that all the pair of observations which have different outcomes, are predicted to have exactly the same score - effectively translating into the model being completely useless.
However, when I checked the data myself, I saw these was not the case. Why would SAS report incorrect statistics in this particular table? The answer was found after lot of research - to optimize the calculations, SAS assumes two scores are identical if they have a difference of < 0.002. However if you notice there is no mention of this in the output. The documentation was also difficult to find. I suspect they added it recently along with an option that helps you to prevent this from happening (more on this later).
This could lead consequences of varying degree when the event being predicted is very rare. Since some or all of the pairs are considered as tied when the scores are within 0.002, the error will lead to wrong concordance. In all cases the result will be incorrect calculation of of c or Somer's D showing much lower predictive power (even zero in extreme case as above) than what it is really.
Ideally SAS should have done two things -
1. SAS output should mention that the table is "Estimated Association of Predicted Probabilities" or something on similar line.
2. The algorithm should vary the 0.002 threshold based on what is the overall rate of the event in the data.
What are the workarounds for the end users?
First, all analysts should be aware of it and not panic if they see strange reports in this particular table. They should then look at the other standard reports and tables they create to evaluate the model performance, disregarding this table completely.
Second, there are two approaches you can take to get the right values.
Approach 1 is to use SAS option BINWIDTH=0 with the MODEL statement in PROC LOGISTIC. Other than the fact that it can take longer as mentioned in SAS documentation (which should be okay since accuracy in this case should win over time taken), the hitch there is that the option was mentioned in SAS Documentation for version 9.22, and it does not work with SAS version 9.1.3 which my organization uses (which makes me feel that the whole explanation on this was recently added)
Approach 2 is to create your own codes/macros to calculate and report the correct values of any statistics out of these table that you normally look at (for example, gini or ROC). It should not be difficult for a seasoned analyst, and something I will recommend as a good exercise for someone with medium experience. After all, who knows what else could come as a surprise if you use SAS original procedures?
When you run a logistic regression in SAS, it shows you a lot of interesting and important parameters in the output. Among the outputs, it will show you the parameter estimates using which you can make a prediction, various statistics involving the data entered, and the statistical confidence of the individual estimates as well as the overall model.
One of the summary reports which tells you how good the model is doing is found at the bottom of the main output -
Association of Predicted Probabilities and Observed Responses Percent Concordant 85.6 Somers' D 0.714 Percent Discordant 14.2 Gamma 0.715 Percent Tied 0.2 Tau-a 0.279 Pairs 7791 c 0.857
An analyst can look at all the above parameters to make a quick judgement on how well the model will perform when put to test for predicting the outcome on a new set of data. The 'c' is basically the area under ROC curve, and Somer's D corresponds to gini coefficient under certain conditions. The c should ideally vary between 0.5 to 1, with 0.5 meaning the model is not working at all.
Therefore I was puzzled when I saw this in one of my outputs -
Association of Predicted Probabilities and Observed Responses Percent Concordant 0.0 Somers' D -.000 Percent Discordant 0.0 Gamma -1.00 Percent Tied 100.0 Tau-a -.000 Pairs 183334788 c 0.500
The c is 0.5, and Somer's D is 0 - which means the model is pretty much useless. However, all the other tables in the output (not shown here) told me that the model's performance was good! Why would this happen? This repeated couple of times recently for different models - in some cases the c was not 0.5, but still was much lower than what I would expect from experience. One indication I had was that over time my team was trying to model target events which are more and more rare. I could trace the solution to 'Percent Concordant/Discordant/Tied' being measured incorrectly in this particular table.
To measure these concordance percentages, you need to look at all possible pairs in your pool which have opposite observations (assuming binary outcome). Then you see in how many of these pairs the model predicts outcomes the way it happened (concordant), in how many the model is actually predicting the other way round and therefore is incorrect (discordant), and in how many such pairs the model score is exactly same. All the other values in this particular table are calculated using this kind of pairing. In the table above, it tells us that all the pair of observations which have different outcomes, are predicted to have exactly the same score - effectively translating into the model being completely useless.
However, when I checked the data myself, I saw these was not the case. Why would SAS report incorrect statistics in this particular table? The answer was found after lot of research - to optimize the calculations, SAS assumes two scores are identical if they have a difference of < 0.002. However if you notice there is no mention of this in the output. The documentation was also difficult to find. I suspect they added it recently along with an option that helps you to prevent this from happening (more on this later).
This could lead consequences of varying degree when the event being predicted is very rare. Since some or all of the pairs are considered as tied when the scores are within 0.002, the error will lead to wrong concordance. In all cases the result will be incorrect calculation of of c or Somer's D showing much lower predictive power (even zero in extreme case as above) than what it is really.
Ideally SAS should have done two things -
1. SAS output should mention that the table is "Estimated Association of Predicted Probabilities" or something on similar line.
2. The algorithm should vary the 0.002 threshold based on what is the overall rate of the event in the data.
What are the workarounds for the end users?
First, all analysts should be aware of it and not panic if they see strange reports in this particular table. They should then look at the other standard reports and tables they create to evaluate the model performance, disregarding this table completely.
Second, there are two approaches you can take to get the right values.
Approach 1 is to use SAS option BINWIDTH=0 with the MODEL statement in PROC LOGISTIC. Other than the fact that it can take longer as mentioned in SAS documentation (which should be okay since accuracy in this case should win over time taken), the hitch there is that the option was mentioned in SAS Documentation for version 9.22, and it does not work with SAS version 9.1.3 which my organization uses (which makes me feel that the whole explanation on this was recently added)
Approach 2 is to create your own codes/macros to calculate and report the correct values of any statistics out of these table that you normally look at (for example, gini or ROC). It should not be difficult for a seasoned analyst, and something I will recommend as a good exercise for someone with medium experience. After all, who knows what else could come as a surprise if you use SAS original procedures?