Monday, November 11, 2013

Regressing \(\ln(Y)\) instead of \(Y\)

TLDR: If you have an estimate for \(Z\), you can't just take \(e^{estimate}\) to estimate \(e^Z\)

A bias correction factor of \(e^{\hat\sigma^2/2}\) has to be applied on the "common sense" estimator \(e^{\hat{E(Z)}}\), to correctly estimate \(Y=e^Z\). The right estimate is \(Y=e^Z\ \hat =\ e^{\hat \sigma^2/2}e^{\hat{E(Z)}}\).


Let's take an attribute \(Y\) which has a lognormal distribution - e.g. Income, Spend, Revenue etc. Since \(\ln(Y)\sim N(\mu,\sigma^2)\), we may choose to model \(\ln(Y)=Z\) instead of \(Y\), and aspire to get a better estimate of \(Y\) from our estimate of \(\ln(Y)\).

Suppose we model \(Z=\ln(Y)\) instead of \(Y\), so that \(Y=e^Z\). We estimate \(E(Z)=\mu\ \hat=\ \hat \mu= f(X)\) based on independent variables \(X\). (Read the symbol \(\hat =\) as "estimated as".)

Given \(\hat \mu\) estimates \(E(Z)\), a common-sense option to estimate \(E(Y)\) might seem to be \(e^\hat \mu\), since \(Y=e^Z\).

But this will not give the best results - simply because \(E(Y)=E(e^Z)\ne e^{E(Z)}\).

\(E(Y)=e^{\mu+\sigma^2/2}\), where \(\sigma^2\) is the variance of the error \(Z-\hat Z\) - and hence a good estimate of \(E(Y)\) would  be \(E(Y)\ \hat=\ e^{\hat \mu+\hat \sigma^2/2}\).

Estimating \(\sigma^2\)

We are used to estimating \(E(Z)\hat=\hat \mu\), which is the just the familiar regression estimate \(\sum \hat \beta_i X_i\). We will need to estimate \(\hat\sigma^2\) now too, to get an accurate point estimate of \(Y=e^Z\).

OLS


If you are running an Ordinary Least Squares regression, an unbiased estimate for \(\sigma^2\) is \(\frac{SSE}{n-k}\) where \(n\)=#observations, and \(k\)=#parameters in the model.

Most statistical packages report these - and if not, you can calculate it as \(\sum (Z-\hat Z)^2/(n-k)\). SAS reports all these if you use PROC REG, in fact, in SAS \(\hat \sigma\) is already reported as "Root MSE", and you can directly take \(\text{Root MSE}^2\) as an estimate of \(\sigma^2\).

Other Regression Frameworks (Machine Learning - RandomForest, NN, KNN, etc.)


A generic way of estimating the \(\sigma^2\) is to borrow the assumption of homoscedasticity from OLS - i.e. that the \(\sigma^2\) does not vary from person to person.

Under this assumption, CLT can be used to show that \(\sum (Z-\hat Z)^2/n\) will converge in probability to \(\sigma^2\), and hence remains a good estimator - even if it may not be unbiased for small \(n\).

If number of parameters in the model is known, then it is recommended to use \(\sum (Z-\hat Z)^2/(n-k)\), mimicking the OLS estimator - it will correct for the bias to some extent, although for large \(n\), the difference between \(1/(n-k)\) and \(1/n\) will be small.

 

Proof

If \(Z=\ln(Y)\sim N(\mu,\sigma^2)\), then \(E(Y)=E(e^Z)=e^{\mu+\sigma^2/2}\).

Citing the mean of lognormal distribution in Wikipedia may work as "proof" in most cases. Just for completeness, a full mathematical proof is also given below.

\[E(e^Z)=\int_{-\infty}^{\infty}{e^z\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(z-\mu)^2}{2\sigma^2}}}\,dz=\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}}{2\sigma^2}}}\,dz\]
 \[\begin{array}{rcl}
\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}&=&z^2-2\mu z + \mu^2-2\sigma^2z\\
&=&z^2-2(\mu+\sigma^2) z + \mu^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 + \mu^2-(\mu+\sigma^2)^2\\
&=&\left(z-(\mu+\sigma^2)\right)^2 - 2\mu\sigma^2-\sigma^4\\
&=&\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}\\
\end{array}\]
\[\begin{array}{rcl}
E(e^Z)&=&\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\left(z-(\mu+\sigma^2)\right)^2}{2\sigma^2}}}\,dz\\
&=&\color{red}{e^{\mu+\sigma^2/2}}\\
\end{array}\]