Tuesday, September 27, 2011

All about "Information Value"

In statistical data mining, sometimes we need to determine out of a set of variables which ones are best in capturing a desired behavior. For example, let's say you have a pool of customers for your credit card company, and you want to determine who out of them are about to default (i.e. refuse to pay up after possibly making a huge expense). You need to then identify which of the attributes you have on the customer can potentially identify and alert you of such behavior. One of the popular ways in which this is done by analysts is by looking at something called 'Information Value'. In the context of data mining is also sometimes referred to by the short form - InfoVal.

Definition
Information Value of \(x\) for measuring \(y\) is a number that attempts to quantify the predictive power of \(x\) in capturing \(y\). Let's assume the target variable \(y\) which we are interested in being able to measure, is a 0-1 variable (or an indicator). Let's also further assume that it is the number of accounts who will go bad in the immediate future. Let's now divide our population in 10 equal parts after sorting the entire pool by \(x\), and create the deciles. Now we are all set to define Information Value -
$$IV_x = \sum_{i=1}^{10}{\left(bad_i-good_i\right)\ln\frac{bad_i}{good_i}}$$
Here,
\(i\) runs from 1 to 10 deciles in which we have divided the data,
\(bad_i\) is the proportion of bad accounts captured in \(i\)th decile out of all bad accounts in the population,
\(good_i\) similarly is proportion of good (i.e. not bad) accounts in \(i\)th decile.

Note that the variable whose effectiveness you want to measure is getting used since it is the variable by which the entire data is sorted and divided into deciles.

How does it work?
But why does it work? You can check that if \(x\) has no information on \(y\) at all, then the \(IV\) turns will trun out to be zero. That's because when you sort by \(x\) and create deciles, the deciles are as good as random with respect to \(y\). Hence, each decile should capture 10% of total bads and 10% of total goods. So \(bad_i-good_i=0\) and \(\ln\frac{bad_i}{good_i}=0\). So the \(IV\) turns out to be zero.

On the other hand if after sorting by \(x\) some decile has higher or lower concentration of bad's than good's, then that would mean that that particular decile is different from the overall population, and \(x\) lets us create it. The decile will contribute a positive value to the summation which defines \(IV\) in the equation above. So it is clear that for a good \(x\) variable, there will be more of such deciles where the proportion of goods and bads differ - and by a larger margin as your \(x\) is more effective in capturing \(y\) - hence \(IV\) indeed gives a measure of predictive power of \(x\).

Issues
However, there is something artificial in the definition of \(IV\) above - it is the functional form. Indeed there can be many different ways to create the functional form that is being summed up.

To give some examples - \(\sum_{i=1}^{10}{\left(bad_i-good_i\right)^2}\), \(\sum_{i=1}^{10}{\left|bad_i-good_i\right|\frac{n_i}{n}}\), etc. all should be equally good candidates.

The last one in particular is interesting - because it has the proportion in the equation, making it a consistent measure. That is, if you decide to divide the data into 20 parts or 30 parts and so on, you will go closer and closer to a limit. Incidentally, the limit to which it converges is essentially gini/2 under some assumptions. For \(IV\) on the other hand, this leads to a problem - you cannot divide the data indefinitely - as you may hit segments which have no good (or no bad) accounts at all - in which case taking the \(ln\) will bomb.

Also for the same reason, it will be inaccurate and unfair to compare two variables when one of them has ties - which makes it impossible to unambiguously divide the population into 10 equal parts. (This is cautionary since it is a mistake that I have seen many analysts make, even when drawing Lorenz curves for example where the inequality of deciles can and should always be taken into account. For IV, to stress, it however cannot be taken into account.)

Origin
If \(IV\) has these problems, how did the definition come about in the first place? Also should you use it?

I can only guess how such a metric came into being, and became popular over that despite having drawbacks. The concept of information actually arises in several other branches of mathematics, eg. Information Theory where you measure a very interesting quantity called entropy (which vaguely resembles \(IV\)), and Fisher's Information of a variable (which deals with how much information a statistic contains about a particular parameter of a distribution). I think someone was aware of these concepts, and wanted to create something similar for the corporate world to measure "information captured in a variable". In the corporate world how things work in most places, is that you show something works, and it becomes de-facto unless someone else challenges it and shows something else works better. Somehow the challenges are rare to come by in most places, which is why at senior leadership there needs to be a conscious effort into engaging the employees into thinking rather than following tradition. For \(IV\), this was pretty much the case. We have already seen that it works, and after it was shown to work it became a standard - without proper math/statistical backing to explore pros and cons and to see if it can be bettered. And once something becomes a standard, it gains inertia and becomes legacy - which propels it through 'common wisdom'. The employees of that particular organization learned about it (they learned the formula, understood what it tries to measure, learned the SAS codes or SQL queries required to compute it). When they migrated, the knowledge migrated with them.

Recommendation
The only reason you might want to use \(IV\) is because of legacy - you may have worked on a bank which used it, and may have developed some reference points (i.e. thumb rules like "if \(IV\) is more than \(abc\) then the variable is good, otherwise it is bad") of good or bad \(IV\) values.Even if you have used \(IV\) before, it should not take a very long time to develop experience on reference points for gini - you will be able to do that through a course of one or two projects. I will in general though recommend not to use IV, in favor of other more intuitive measures - roc, gini, KS to name some. I personally prefer gini - it is easy to understand, is consistent, and is very robust in measuring power of a variable.

To conclude, though \(IV\) is still popular in some places of the banking world, I would recommend not using it in favor of gini/ROC which measures the same thing while being more intuitive and without the flaws of \(IV\).

4 comments:

  1. At a more macro level, this is a great way to compute the *relevance* of a unit (or even a portfolio) of information to any given business function. Computing it would give IT managers a sense how to prioritize information management investments.

    For an ongoing discussion of information as a corporate asset and quantifying its value: www.centerforinfonomics.org or the Center for Infonomics LinkedIn Group.

    ReplyDelete
  2. Query : How would ROC/Gini capture non monotonic relationships in the data?

    ReplyDelete
  3. It won't. If you suspect non-monotonic relationship, you have no choice other than to bucket your data, i.e. bin it into parts by your x- variable. Then you order by the average of target variable, and calculate Gini. But you have to be careful so that you do not overfit it while choosing the number of buckets.

    ReplyDelete
  4. to Anonymous: KS is agnostic to the monotonic character of the bivariate relation, though keyed off a kind of max deviation. I like the idea of re-ordering to calculate the Gini. it is no more arbitrary than IV and you at least know what you are measuring, hence Gini << IV <= Voodoo .

    ReplyDelete