Opinions 5: statistics

Showing posts with label statistics. Show all posts

Monday, November 11, 2013

Regressing $\ln(Y)$ instead of $Y$

TLDR: If you have an estimate for $Z$, you can't just take $e^{estimate}$ to estimate $e^Z$

A bias correction factor of $e^{\hat\sigma^2/2}$ has to be applied on the "common sense" estimator $e^{\hat{E(Z)}}$, to correctly estimate $Y=e^Z$. The right estimate is $Y=e^Z\ \hat =\ e^{\hat \sigma^2/2}e^{\hat{E(Z)}}$.

Let's take an attribute $Y$ which has a lognormal distribution - e.g. Income, Spend, Revenue etc. Since $\ln(Y)\sim N(\mu,\sigma^2)$, we may choose to model $\ln(Y)=Z$ instead of $Y$, and aspire to get a better estimate of $Y$ from our estimate of $\ln(Y)$.

Suppose we model $Z=\ln(Y)$ instead of $Y$, so that $Y=e^Z$. We estimate $E(Z)=\mu\ \hat=\ \hat \mu= f(X)$ based on independent variables $X$. (Read the symbol $\hat =$ as "estimated as".)

Given $\hat \mu$ estimates $E(Z)$, a common-sense option to estimate $E(Y)$ might seem to be $e^\hat \mu$, since $Y=e^Z$.

But this will not give the best results - simply because $E(Y)=E(e^Z)\ne e^{E(Z)}$.

$E(Y)=e^{\mu+\sigma^2/2}$, where $\sigma^2$ is the variance of the error $Z-\hat Z$ - and hence a good estimate of $E(Y)$ would be $E(Y)\ \hat=\ e^{\hat \mu+\hat \sigma^2/2}$.

Estimating $\sigma^2$

We are used to estimating $E(Z)\hat=\hat \mu$, which is the just the familiar regression estimate $\sum \hat \beta_i X_i$. We will need to estimate $\hat\sigma^2$ now too, to get an accurate point estimate of $Y=e^Z$.

OLS

If you are running an Ordinary Least Squares regression, an unbiased estimate for $\sigma^2$ is $\frac{SSE}{n-k}$ where $n$=#observations, and $k$=#parameters in the model.

Most statistical packages report these - and if not, you can calculate it as $\sum (Z-\hat Z)^2/(n-k)$. SAS reports all these if you use PROC REG, in fact, in SAS $\hat \sigma$ is already reported as "Root MSE", and you can directly take $\text{Root MSE}^2$ as an estimate of $\sigma^2$.

Other Regression Frameworks (Machine Learning - RandomForest, NN, KNN, etc.)

A generic way of estimating the $\sigma^2$ is to borrow the assumption of homoscedasticity from OLS - i.e. that the $\sigma^2$ does not vary from person to person.

Under this assumption, CLT can be used to show that $\sum (Z-\hat Z)^2/n$ will converge in probability to $\sigma^2$, and hence remains a good estimator - even if it may not be unbiased for small $n$.

If number of parameters in the model is known, then it is recommended to use $\sum (Z-\hat Z)^2/(n-k)$, mimicking the OLS estimator - it will correct for the bias to some extent, although for large $n$, the difference between $1/(n-k)$ and $1/n$ will be small.

Proof

If $Z=\ln(Y)\sim N(\mu,\sigma^2)$, then $E(Y)=E(e^Z)=e^{\mu+\sigma^2/2}$.

Citing the mean of lognormal distribution in Wikipedia may work as "proof" in most cases. Just for completeness, a full mathematical proof is also given below.

\[E(e^Z)=\int_{-\infty}^{\infty}{e^z\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(z-\mu)^2}{2\sigma^2}}}\,dz=\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}}{2\sigma^2}}}\,dz\]
\[\begin{array}{rcl}
\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}&=&z^2-2\mu z + \mu^2-2\sigma^2z\\
&=&z^2-2(\mu+\sigma^2) z + \mu^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 + \mu^2-(\mu+\sigma^2)^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 - 2\mu\sigma^2-\sigma^4\\
&=&\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}\\
\end{array}\]
\[\begin{array}{rcl}
E(e^Z)&=&\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\left(z-(\mu+\sigma^2)\right)^2}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\\
\end{array}\]

Friday, August 16, 2013

Random Forest: Per prediction confidence

Random Forest

Random Forest is a machine learning technique to automatically classify data into their most likely category based on a training sample. For example, you may want to categorize a movie as good or bad, a drug as effective or ineffective on a given patient, a bitmap image into the digits 0-9.

To perform this feat, before RF 'looks' at a sample with known categories, and 'learns' from it. This is called the 'training' process.

Since it uses a training data to learn, the method is called supervised learning, as opposed to unsupervised learning where a data is categorized into its most likely clusters without having any hint of what those clusters may be from any previously available data sources.

Some other supervised learning techniques are Logistic Regression, Neural Networks, Support Vector Machines, and Gradient Boosting Machines.

A Random Forest, as the name indicates, generates a collection of trees - viz. Decision Trees - at the end of the learning process. Each tree by itself is a predictor. To categorize a new observation, it is run through all the trees, and the results are aggregated.

Testing Random Forest

How well does it work?

To test it, I created a data that where I color a point $(x,y)$ as Green if $y>\sin(x)$, otherwise I call it Red.

Note conventional methods like Logistic Regression, $P(green)=\left({1+e^{-(\alpha+\beta_1 x + \beta_2 y)}}\right)^{-1}$, would fail on this data, as it depends on the relationships being linear.

Test Function

Then I created samples of different sizes consisting of $(x,y)$ points, applied the red/green labeling for training the RF.

Training Samples

As I was interested to see the quality of the prediction varies on the amount of data, I fed the above samples separately. The number of trees that RF generates, which upto a point increases the predictive power, was also varied. It was RF's job to come up with the function just by looking these points.

Guess made by Random Forest

Looking at this pictures, it it apparent that RF recreated the functions very accurately, even with as low as 100 observations - for which it may actually be difficult for a person to guess the function.

For our simple data, the incremental gain by increasing the number of trees from 50 to 1000 is small.

Confidence

RF has a natural candidate that can be used to estimate how confident it is for each prediction. If there are $n$ trees, we can simply compute the variance of the predictions $v_k=\frac{1}{n}\sum_{i=1}^n(\hat p_k-\frac{1}{n}\sum(\hat p_k))$ for each observation.

A sense of the confidence can be seen by looking at the white region in the charts above. Below is a plot of the relative standard deviation which brings it out clearly.

Relative Uncertainty of Random Forest

Darker areas correspond to regions of higher uncertainty.

Observe that the most uncertain regions are at the boundary region, as one would expect, since it is at the boundary you will be most unsure about what color a point will have.

Also as the size of the training sample is increased, uncertainty recedes more and more towards the boundary of the classes.

Conclusion

Per each data point, the variance of outcomes by all the trees can be used to estimate the confidence in its prediction. It is valuable especially for estimating continuous attributes (though the simple example above demonstrates this on discrete outcomes), but it is generally overlooked. In a scenario where a bad decision is costly, the confidence can be used to filter the data before taking a decision.

Tuesday, September 27, 2011

All about "Information Value"

In statistical data mining, sometimes we need to determine out of a set of variables which ones are best in capturing a desired behavior. For example, let's say you have a pool of customers for your credit card company, and you want to determine who out of them are about to default (i.e. refuse to pay up after possibly making a huge expense). You need to then identify which of the attributes you have on the customer can potentially identify and alert you of such behavior. One of the popular ways in which this is done by analysts is by looking at something called 'Information Value'. In the context of data mining is also sometimes referred to by the short form - InfoVal.

Definition
Information Value of $x$ for measuring $y$ is a number that attempts to quantify the predictive power of $x$ in capturing $y$. Let's assume the target variable $y$ which we are interested in being able to measure, is a 0-1 variable (or an indicator). Let's also further assume that it is the number of accounts who will go bad in the immediate future. Let's now divide our population in 10 equal parts after sorting the entire pool by $x$, and create the deciles. Now we are all set to define Information Value -
$$IV_x = \sum_{i=1}^{10}{\left(bad_i-good_i\right)\ln\frac{bad_i}{good_i}}$$
Here,
$i$ runs from 1 to 10 deciles in which we have divided the data,
$bad_i$ is the proportion of bad accounts captured in $i$th decile out of all bad accounts in the population,
$good_i$ similarly is proportion of good (i.e. not bad) accounts in $i$th decile.

Note that the variable whose effectiveness you want to measure is getting used since it is the variable by which the entire data is sorted and divided into deciles.

How does it work?
But why does it work? You can check that if $x$ has no information on $y$ at all, then the $IV$ turns will trun out to be zero. That's because when you sort by $x$ and create deciles, the deciles are as good as random with respect to $y$. Hence, each decile should capture 10% of total bads and 10% of total goods. So $bad_i-good_i=0$ and $\ln\frac{bad_i}{good_i}=0$. So the $IV$ turns out to be zero.

On the other hand if after sorting by $x$ some decile has higher or lower concentration of bad's than good's, then that would mean that that particular decile is different from the overall population, and $x$ lets us create it. The decile will contribute a positive value to the summation which defines $IV$ in the equation above. So it is clear that for a good $x$ variable, there will be more of such deciles where the proportion of goods and bads differ - and by a larger margin as your $x$ is more effective in capturing $y$ - hence $IV$ indeed gives a measure of predictive power of $x$.

Issues
However, there is something artificial in the definition of $IV$ above - it is the functional form. Indeed there can be many different ways to create the functional form that is being summed up.

To give some examples - $\sum_{i=1}^{10}{\left(bad_i-good_i\right)^2}$, $\sum_{i=1}^{10}{\left|bad_i-good_i\right|\frac{n_i}{n}}$, etc. all should be equally good candidates.

The last one in particular is interesting - because it has the proportion in the equation, making it a consistent measure. That is, if you decide to divide the data into 20 parts or 30 parts and so on, you will go closer and closer to a limit. Incidentally, the limit to which it converges is essentially gini/2 under some assumptions. For $IV$ on the other hand, this leads to a problem - you cannot divide the data indefinitely - as you may hit segments which have no good (or no bad) accounts at all - in which case taking the $ln$ will bomb.

Also for the same reason, it will be inaccurate and unfair to compare two variables when one of them has ties - which makes it impossible to unambiguously divide the population into 10 equal parts. (This is cautionary since it is a mistake that I have seen many analysts make, even when drawing Lorenz curves for example where the inequality of deciles can and should always be taken into account. For IV, to stress, it however cannot be taken into account.)

Origin
If $IV$ has these problems, how did the definition come about in the first place? Also should you use it?

I can only guess how such a metric came into being, and became popular over that despite having drawbacks. The concept of information actually arises in several other branches of mathematics, eg. Information Theory where you measure a very interesting quantity called entropy (which vaguely resembles $IV$), and Fisher's Information of a variable (which deals with how much information a statistic contains about a particular parameter of a distribution). I think someone was aware of these concepts, and wanted to create something similar for the corporate world to measure "information captured in a variable". In the corporate world how things work in most places, is that you show something works, and it becomes de-facto unless someone else challenges it and shows something else works better. Somehow the challenges are rare to come by in most places, which is why at senior leadership there needs to be a conscious effort into engaging the employees into thinking rather than following tradition. For $IV$, this was pretty much the case. We have already seen that it works, and after it was shown to work it became a standard - without proper math/statistical backing to explore pros and cons and to see if it can be bettered. And once something becomes a standard, it gains inertia and becomes legacy - which propels it through 'common wisdom'. The employees of that particular organization learned about it (they learned the formula, understood what it tries to measure, learned the SAS codes or SQL queries required to compute it). When they migrated, the knowledge migrated with them.

Recommendation
The only reason you might want to use $IV$ is because of legacy - you may have worked on a bank which used it, and may have developed some reference points (i.e. thumb rules like "if $IV$ is more than $abc$ then the variable is good, otherwise it is bad") of good or bad $IV$ values.Even if you have used $IV$ before, it should not take a very long time to develop experience on reference points for gini - you will be able to do that through a course of one or two projects. I will in general though recommend not to use IV, in favor of other more intuitive measures - roc, gini, KS to name some. I personally prefer gini - it is easy to understand, is consistent, and is very robust in measuring power of a variable.

To conclude, though $IV$ is still popular in some places of the banking world, I would recommend not using it in favor of gini/ROC which measures the same thing while being more intuitive and without the flaws of $IV$.

Opinions 5