## Monday, November 11, 2013

### Regressing $$\ln(Y)$$ instead of $$Y$$

TLDR: If you have an estimate for $$Z$$, you can't just take $$e^{estimate}$$ to estimate $$e^Z$$

A bias correction factor of $$e^{\hat\sigma^2/2}$$ has to be applied on the "common sense" estimator $$e^{\hat{E(Z)}}$$, to correctly estimate $$Y=e^Z$$. The right estimate is $$Y=e^Z\ \hat =\ e^{\hat \sigma^2/2}e^{\hat{E(Z)}}$$.

Let's take an attribute $$Y$$ which has a lognormal distribution - e.g. Income, Spend, Revenue etc. Since $$\ln(Y)\sim N(\mu,\sigma^2)$$, we may choose to model $$\ln(Y)=Z$$ instead of $$Y$$, and aspire to get a better estimate of $$Y$$ from our estimate of $$\ln(Y)$$.

Suppose we model $$Z=\ln(Y)$$ instead of $$Y$$, so that $$Y=e^Z$$. We estimate $$E(Z)=\mu\ \hat=\ \hat \mu= f(X)$$ based on independent variables $$X$$. (Read the symbol $$\hat =$$ as "estimated as".)

Given $$\hat \mu$$ estimates $$E(Z)$$, a common-sense option to estimate $$E(Y)$$ might seem to be $$e^\hat \mu$$, since $$Y=e^Z$$.

But this will not give the best results - simply because $$E(Y)=E(e^Z)\ne e^{E(Z)}$$.

$$E(Y)=e^{\mu+\sigma^2/2}$$, where $$\sigma^2$$ is the variance of the error $$Z-\hat Z$$ - and hence a good estimate of $$E(Y)$$ would  be $$E(Y)\ \hat=\ e^{\hat \mu+\hat \sigma^2/2}$$.

### Estimating $$\sigma^2$$

We are used to estimating $$E(Z)\hat=\hat \mu$$, which is the just the familiar regression estimate $$\sum \hat \beta_i X_i$$. We will need to estimate $$\hat\sigma^2$$ now too, to get an accurate point estimate of $$Y=e^Z$$.

#### OLS

If you are running an Ordinary Least Squares regression, an unbiased estimate for $$\sigma^2$$ is $$\frac{SSE}{n-k}$$ where $$n$$=#observations, and $$k$$=#parameters in the model.

Most statistical packages report these - and if not, you can calculate it as $$\sum (Z-\hat Z)^2/(n-k)$$. SAS reports all these if you use PROC REG, in fact, in SAS $$\hat \sigma$$ is already reported as "Root MSE", and you can directly take $$\text{Root MSE}^2$$ as an estimate of $$\sigma^2$$.

#### Other Regression Frameworks (Machine Learning - RandomForest, NN, KNN, etc.)

A generic way of estimating the $$\sigma^2$$ is to borrow the assumption of homoscedasticity from OLS - i.e. that the $$\sigma^2$$ does not vary from person to person.

Under this assumption, CLT can be used to show that $$\sum (Z-\hat Z)^2/n$$ will converge in probability to $$\sigma^2$$, and hence remains a good estimator - even if it may not be unbiased for small $$n$$.

If number of parameters in the model is known, then it is recommended to use $$\sum (Z-\hat Z)^2/(n-k)$$, mimicking the OLS estimator - it will correct for the bias to some extent, although for large $$n$$, the difference between $$1/(n-k)$$ and $$1/n$$ will be small.

### Proof

If $$Z=\ln(Y)\sim N(\mu,\sigma^2)$$, then $$E(Y)=E(e^Z)=e^{\mu+\sigma^2/2}$$.

Citing the mean of lognormal distribution in Wikipedia may work as "proof" in most cases. Just for completeness, a full mathematical proof is also given below.

$E(e^Z)=\int_{-\infty}^{\infty}{e^z\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(z-\mu)^2}{2\sigma^2}}}\,dz=\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{#3366FF}{(z-\mu)^2-2\sigma^2 z}}{2\sigma^2}}}\,dz$
$\begin{array}{rcl} \color{#3366FF}{(z-\mu)^2-2\sigma^2 z}&=&z^2-2\mu z + \mu^2-2\sigma^2z\\ &=&z^2-2(\mu+\sigma^2) z + \mu^2\\ &=&\left(z-(\mu+\sigma^2)\right)^2 + \mu^2-(\mu+\sigma^2)^2\\ &=&\left(z-(\mu+\sigma^2)\right)^2 - 2\mu\sigma^2-\sigma^4\\ &=&\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}\\ \end{array}$
$\begin{array}{rcl} E(e^Z)&=&\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\color{green}{\left(z-(\mu+\sigma^2)\right)^2} - \color{red}{2\sigma^2(\mu+\sigma^2/2)}}{2\sigma^2}}}\,dz\\ &=&\color{red}{e^{\mu+\sigma^2/2}}\int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\left(z-(\mu+\sigma^2)\right)^2}{2\sigma^2}}}\,dz\\ &=&\color{red}{e^{\mu+\sigma^2/2}}\\ \end{array}$