TLDR: If you have an estimate for Z, you can't just take eestimate to estimate eZ
A bias correction factor of eˆσ2/2 has to be applied on the "common sense" estimator e^E(Z), to correctly estimate Y=eZ. The right estimate is Y=eZ ˆ= eˆσ2/2e^E(Z).
A bias correction factor of eˆσ2/2 has to be applied on the "common sense" estimator e^E(Z), to correctly estimate Y=eZ. The right estimate is Y=eZ ˆ= eˆσ2/2e^E(Z).
Suppose we model Z=ln(Y) instead of Y, so that Y=eZ. We estimate E(Z)=μ ˆ= ˆμ=f(X) based on independent variables X. (Read the symbol ˆ= as "estimated as".)
Given ˆμ estimates E(Z), a common-sense option to estimate E(Y) might seem to be eˆμ, since Y=eZ.
But this will not give the best results - simply because E(Y)=E(eZ)≠eE(Z).
E(Y)=eμ+σ2/2, where σ2 is the variance of the error Z−ˆZ - and hence a good estimate of E(Y) would be E(Y) ˆ= eˆμ+ˆσ2/2.
Estimating σ2
We are used to estimating E(Z)ˆ=ˆμ, which is the just the familiar regression estimate ∑ˆβiXi. We will need to estimate ˆσ2 now too, to get an accurate point estimate of Y=eZ.
OLS
If
you are running an Ordinary Least Squares regression, an unbiased
estimate for σ2 is SSEn−k where n=#observations, and k=#parameters in the model.
Most
statistical packages report these - and if not, you can calculate it as
∑(Z−ˆZ)2/(n−k). SAS reports all these if you use PROC REG,
in fact, in SAS ˆσ is already reported as "Root MSE", and
you can directly take Root MSE2 as an estimate of
σ2.
Other Regression Frameworks (Machine Learning - RandomForest, NN, KNN, etc.)
A
generic way of estimating the σ2 is to borrow the assumption
of homoscedasticity from OLS - i.e. that the σ2 does not vary
from person to person.
Under
this assumption, CLT can be used to show that ∑(Z−ˆZ)2/n
will converge in probability to σ2, and hence remains a good
estimator - even if it may not be unbiased for small n.
If number of parameters in the model is known, then it is recommended to use ∑(Z−ˆZ)2/(n−k), mimicking
the OLS estimator - it will correct for the bias to some extent,
although for large n, the difference between 1/(n−k) and 1/n
will be small.
Proof
If Z=ln(Y)∼N(μ,σ2), then E(Y)=E(eZ)=eμ+σ2/2.Citing the mean of lognormal distribution in Wikipedia may work as "proof" in most cases. Just for completeness, a full mathematical proof is also given below.
E(eZ)=∫∞−∞ez1√2πσe−(z−μ)22σ2dz=∫∞−∞1√2πσe−(z−μ)2−2σ2z2σ2dz
(z−μ)2−2σ2z=z2−2μz+μ2−2σ2z=z2−2(μ+σ2)z+μ2=(z−(μ+σ2))2+μ2−(μ+σ2)2=(z−(μ+σ2))2−2μσ2−σ4=(z−(μ+σ2))2−2σ2(μ+σ2/2)
E(eZ)=∫∞−∞1√2πσe−(z−(μ+σ2))2−2σ2(μ+σ2/2)2σ2dz=eμ+σ2/2∫∞−∞1√2πσe−(z−(μ+σ2))22σ2dz=eμ+σ2/2