Processing math: 100%

Monday, November 11, 2013

Regressing ln(Y) instead of Y

TLDR: If you have an estimate for Z, you can't just take eestimate to estimate eZ

A bias correction factor of eˆσ2/2 has to be applied on the "common sense" estimator e^E(Z), to correctly estimate Y=eZ. The right estimate is Y=eZ ˆ= eˆσ2/2e^E(Z).


Let's take an attribute Y which has a lognormal distribution - e.g. Income, Spend, Revenue etc. Since ln(Y)N(μ,σ2), we may choose to model ln(Y)=Z instead of Y, and aspire to get a better estimate of Y from our estimate of ln(Y).

Suppose we model Z=ln(Y) instead of Y, so that Y=eZ. We estimate E(Z)=μ ˆ= ˆμ=f(X) based on independent variables X. (Read the symbol ˆ= as "estimated as".)

Given ˆμ estimates E(Z), a common-sense option to estimate E(Y) might seem to be eˆμ, since Y=eZ.

But this will not give the best results - simply because E(Y)=E(eZ)eE(Z).

E(Y)=eμ+σ2/2, where σ2 is the variance of the error ZˆZ - and hence a good estimate of E(Y) would  be E(Y) ˆ= eˆμ+ˆσ2/2.

Estimating σ2

We are used to estimating E(Z)ˆ=ˆμ, which is the just the familiar regression estimate ˆβiXi. We will need to estimate ˆσ2 now too, to get an accurate point estimate of Y=eZ.

OLS


If you are running an Ordinary Least Squares regression, an unbiased estimate for σ2 is SSEnk where n=#observations, and k=#parameters in the model.

Most statistical packages report these - and if not, you can calculate it as (ZˆZ)2/(nk). SAS reports all these if you use PROC REG, in fact, in SAS ˆσ is already reported as "Root MSE", and you can directly take Root MSE2 as an estimate of σ2.

Other Regression Frameworks (Machine Learning - RandomForest, NN, KNN, etc.)


A generic way of estimating the σ2 is to borrow the assumption of homoscedasticity from OLS - i.e. that the σ2 does not vary from person to person.

Under this assumption, CLT can be used to show that (ZˆZ)2/n will converge in probability to σ2, and hence remains a good estimator - even if it may not be unbiased for small n.

If number of parameters in the model is known, then it is recommended to use (ZˆZ)2/(nk), mimicking the OLS estimator - it will correct for the bias to some extent, although for large n, the difference between 1/(nk) and 1/n will be small.

 

Proof

If Z=ln(Y)N(μ,σ2), then E(Y)=E(eZ)=eμ+σ2/2.

Citing the mean of lognormal distribution in Wikipedia may work as "proof" in most cases. Just for completeness, a full mathematical proof is also given below.

E(eZ)=ez12πσe(zμ)22σ2dz=12πσe(zμ)22σ2z2σ2dz
 (zμ)22σ2z=z22μz+μ22σ2z=z22(μ+σ2)z+μ2=(z(μ+σ2))2+μ2(μ+σ2)2=(z(μ+σ2))22μσ2σ4=(z(μ+σ2))22σ2(μ+σ2/2)
E(eZ)=12πσe(z(μ+σ2))22σ2(μ+σ2/2)2σ2dz=eμ+σ2/212πσe(z(μ+σ2))22σ2dz=eμ+σ2/2